Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fingerprinting threat using the TOPICS API #74

Closed
AlexandreGilotte opened this issue May 30, 2022 · 6 comments
Closed

Fingerprinting threat using the TOPICS API #74

AlexandreGilotte opened this issue May 30, 2022 · 6 comments

Comments

@AlexandreGilotte
Copy link

We would like to point out a potential privacy attack made possible by the “past observability” rule of the Topics API. This rule could allow for building some cross-domains unique user identifiers from the API, potentially on a large scale.

The attack would leverage the following statement :
"Only callers that observed the user visit a site about the topic in question within the past three weeks can receive the topic."

This rule leaks, in certain conditions, one bit of data per caller: has observed or not. By multiplying the number of callers, a fingerprinting vector can be obtained.

An attacker can leverage this observation by using several endpoints (belonging to distinct domains) to call the API (say 100) which would each observe different random sets of users.
The attack would work in two phases: during the first week, the attacker would randomly "tag" the users (i.e. call or not the API), and an identifier would be retrieved in the second week.

First phase: random “tagging”

During the first week, each time a user visits for the first time a website where the attacker is plugged, the following script is run:

for (i in range 1, 100)
  if ( random() < tagging_proba) # here tagging_proba is a parameter of the attack, eg 0.5
    call "document.browsingTopics()" from attacker_endpoint_i

Second phase: building an identifier

During the second week, when a user visits a website the attacker is integrated on, the attacker will query the topics API from all 100 domains attacker_endpoint_i.

Let us assume that this user has a non random topic this week on at least one website where the attacker is present. Because of the randomness of the tagging in phase one, only a subset of the 100 attacker endpoints will return this topic. The attacker can thus build a 100-dimensions binary vector, describing which of its domains “observed” the user.
This vector will be the same for a given user on all sites sharing the same weekly topic, i.e. roughly 1/5 of the sites.
Moreover, this vector would be almost certainly unique to this user, provided the components of this vector are random enough (That is, not almost all 0 or 1. This can be achieved by tuning the tagging_proba parameter.)

The attacker has thus built a unique identifier for this user on 20% of the web.

Implementation details

Some details may reduce a bit the scope of the attack, but a careful implementation by the attacker should still enable a large reach:

The attacker should adapt to the fact that the start of the epoch/week is randomized , for example by tagging during only a few days instead of one whole week; the attack would only work on sites whose "start of the week" do not fall during the tagging period.

Reading the identifier on week 2 may cause some endpoints to "observe" the user, and thus modify the user identifier on subsequent calls.
To avoid this, the attacker could only read the unique identifier from sites whose topics are in a predefined list. (This way reading would not modify the observations of topics outside of this list.)

Tuning the tagging probability: this parameter may have to be tuned to obtain unique identifiers with high probability. Let n the number of distinct sites of a given topic, on which the attacker is plugged. For each attack endpoint, the user is 'observed' on this topic if the random tagging fired on at least one of these n sites, which would happen with a probability 1-(1-tagging_proba)^n . For example, if we set the tagging_proba parameter to 0.5 and the user visited n=10 sites, then each caller has a probability around 0.999 to have observed the user; this does not give enough entropy to the identifier (in other words, it becomes very likely that all 100 attack endpoint observed the user, and this is not unique to this user). Still with n=10, if we used instead a tagging_proba = 0.1 , this observation probability for each endpoint becomes 0.651; and the probability that another user share the same 100-dimensional identifier becomes negligible. Finding a ‘good’ parameter working for most users would may thus require to experiment a bit.

We note that the idea of this kind of attack was mentioned in topics/topics_analysis.pdf at main · patcg-individual-drafts/topics , but not detailed further.

What are your thoughts on how to prevent or mitigate this attack?

@martinthomson
Copy link

This is very good. I've a few suggestions that might improve the attack, not mitigate it.

Firstly, I wouldn't draw at random. I would instead pick an identifier for every user. This would make the attack more reliable and you could probably drop down to a smaller number of bits. 20 bits are enough to target 1 million users at a time, just avoid setting all the bits to 0 so you can be sure the API is functioning. You might also add a checksum or MAC to deal with the potential for the API results being drawn randomly.

As you observe, repeated random draws at p=0.5 end up bearing witness to a topic too often. You therefore need to ensure that your witness sites are only used once to create the witness or that witness consistently chooses the same value. If the witness is unique to one site that "sends" the identifier and one site that "receives" the identifier, this should be easy. Reading also destroys the signal, so you need to ensure that you stop using the sites you use after reading the data back. Altering the tagging probability isn't necessary if you follow those rules.

The site on which the tagging occurs will have a topic allocated. The tagging is only effective if - on the target site - the browser draws that topic. This occurs with a probability of 19%, but only if the topic chosen is in the top 5. But you also need to ensure that the topic is in the top 5. If you don't hit the top 5, you learn nothing.

It is possible to push a topic up into the top 5 if you can engineer more visits to sites with the right topic. Let's say that you create a multipage article and have each page on a different domain/site (see the subdomain hack for choosing a topic; getting on the PSL is also an option, but that will have some user experience hitches you probably want to avoid). Then you can have a single site either expand its coverage of different topics, increase the chance that a witnessed topic hits the top 5, or both. Here, you can coordinate your witness sites so that they have the same pattern of witness across topics. Reinforcing a topic only requires that one of your chosen witnesses is invoked.

Altering the top 5 rankings has the "benefit" of spoiling the use of Topics for others, though it would be easier to ensure that you get a top 5 position for your chosen topic by picking topics that are already highly ranked. Trying to push past a genuine interest might require more site visits than you can engineer.

This trick effectively relies on navigation tracking for the "sending/witnessing" part to build up effectiveness. However, you don't need any navigation that bridges between the sending and receiving sites.

Only the 5% probability of a random topic is effective at blocking this transfer. You can detect this reliably as long as you don't have all witnesses invoke the API - a random value will be presented to all witnesses. For those users that produce a random response, you can try again next week.

@jkarlin
Copy link
Collaborator

jkarlin commented May 31, 2022

Great discussion, and thanks for raising this Alexandre. This is indeed the type of attack that was alluded to in the paper and in the explainer. And it's clearly part of the privacy trade-off between filtering and not filtering topics that needs to be made. The privacy win is that it's better than third-party cookies, but there is a potential fingerprinting risk as you've described here.

I am concerned about the general risk of this sort of attack, but not terribly so. In other words, it's one of those things that we can leverage mitigations for if/when we observe such abuse. But I'm not yet convinced that it's inevitable that we'll see it crop up.

If it does come up, here are some possible mitigations:

  1. Increase the number of top topics for the week (reducing the chance that the attacker's topic is shown to the site)
  2. Only allow the top X (say 25) observers of a topic in a given epoch see the topic
  3. Return a random value with increasing probability the more the API is called on a page
  4. Make the epoch transition be per-caller + per-site instead of just per-site
  5. Direct browser intervention against abusers

This does remind me of one of the drawbacks of allowing for the separation of observing a topic and getting a topic into two separate API calls as described in #54. As it makes it easier to perform this sort of attack (reading topics without observing means that you call the API on any page instead of only on pages that aren't about the given topic). But I don't think that change would really change the overall risk/impact of the attack much.

More thoughts welcome!

@martinthomson
Copy link

I feel obligated to mention the obvious mitigation: removing the topic witness requirement. That limitation is what created this problem in the first place.

Option 1 doesn't really change the dynamics. A 19% chance of pulling off this trick isn't that much different from the 9.5% chance you get with a top 10 draw. Particularly when you get into the tricks you might use to create broader witness. It could also make the API less useful.

Option 2 might double down on a bias toward larger actors.

Option 3 is genuinely interesting, but I don't know how it would work if the goal is for the invocations to produce consistent results over repeated invocations. I guess that you have to store who asked in what order for each site. For sites that want to abuse this, they can limit the number of bits they pass to keep below any threshold and simply spread the bits out over time. For an audience of 50 million people, you only need 26 bits.

Option 4 seems to suggest that you might end up exposing more information in the aggregate than you might want. I assume that you don't want to assume that the epoch transition for a caller is a secret. So you gain little. Also, you end up with a mix of callers on each site, with callers getting topics from adjacent weeks, potentially with a different set of site visits and a different set of top 5 topics. I can only see this increasing the information that is revealed through the API.

Option 5 is a fairly universal backstop, but I think that we should hold new proposals to a higher standard than that.

@jkarlin
Copy link
Collaborator

jkarlin commented Jun 6, 2022

I feel obligated to mention the obvious mitigation: removing the topic witness requirement. That limitation is what created this problem in the first place.

That filtering is what keeps this proposal a clear step forward in privacy over where Chrome is today with third-party cookies. Would you be comfortable with the API if we removed it?

@martinthomson
Copy link

More comfortable, yes. I have a few other reservations, but creating a supercookie--imperfect as it might be--isn't good.

@jkarlin
Copy link
Collaborator

jkarlin commented Jun 22, 2023

Since this discussion, we've added a requirement on Chrome that developers enroll to use the API and to attest that they won't abuse the API. That's not a technical solution, but I do believe it goes a long way to addressing this problem. Closing for now.

@jkarlin jkarlin closed this as completed Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants