Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knowing the source site in the aggregation API / aggregate queries need key discovery mechanism #583

Open
alois-bissuel opened this issue Oct 12, 2022 · 15 comments

Comments

@alois-bissuel
Copy link
Contributor

Hello,

We have two use cases in advertising which are hard to fit in with the current version of the aggregate API. Advertisers and marketers want to know from which domain conversions were made. For fraud prevention, knowing on which domain clicks were made is paramount for detecting and banishing shady websites being set up for siphoning money off advertiser.

The source site (ie the publisher domain) has been removed in aggregatable reports in #445. Before this pull request (which is trying to solve #439), the source site was available in the clear (as the attribution_destination currently is).

Encoding the publisher domain in the aggregate API in its current state (ie no source_site in the aggregatable reports) is a very hard problem because of its following characteristics:

  • it has a high cardinality (in hundreds of thousands or more, depending on the aggregation window).
  • it is dynamic (any publisher can easily monetize its new website on the open web by plugging it to a SSP).
  • it is hard to know a good a priori estimate of which publishers might lead to conversions (a campaign for high-end headphones might do most of its conversion on a small audiophile blog).

So far, I see three potentials solutions, the first two of which use plausible deniability to add back in clear the source_site to the aggregatable reports:

  1. Include back the source_site in aggregatable reports, and send with some probability empty conversion reports (eg random key and zero value) from any website the user has visited. This might enable the exfiltration of even more user data than before (a very targeted campaign will allow a bad party to gain knowledge of the browsing habit of the targeted user group). Hence the second proposition.
  2. Same as 1., but using a domain picked from a list of the most visited publishers of the country or region. This list can be generated in a privacy-safe and decentralized manner using a mechanism such as RAPPOR.
  3. Adding a mechanism for key retrieval or discovery in the aggregation service. The issue there is that encoding the domain takes a large number of bits, ranging from 20 bits to encode a million different domains using a dictionary, which is the best encoding in term of space, till 5 bits per character (using the lower case Latin alphabet) where the full 128 bits key space might not be large enough to encode many long domains. This type of mechanism might be useful for any dimension with a large cardinality.

What are your suggestions?

N.B. This issue also concerns the Private Aggregation API, as it uses the same report format as ARA for slightly different use cases. Cross-posting a very similar issue there.

@csharrison
Copy link
Collaborator

Thanks for filing @alois-bissuel. The use-case makes sense. To me, something like (3) seems to be the most natural choice by putting most of the onus on the aggregation service to provide the ability to measure aggregates you are interested in. We would need to think through what that means for the input encoding and query model. One question: is it feasible for you to encode the publisher in a dense encoding i.e. using a dictionary vs. a sparse encoding like the raw domain bytes?

Note we are also thinking through a design for token-based authentication which would interact with (1) and (2). Will try to have something published soon.

cc @ruclohani @badih @palenica

@alois-bissuel
Copy link
Contributor Author

Thanks for the quick answer!
Using a dense encoding for the publisher is challenging for a number of reasons:

  • the list of publishers is very dynamic. As a rule of thumb, a third of all domains we see are long-lived, meaning that we do displays on them every days, a third of them are short-lived (we do displays on them for only a few days a month). Some of these short-lived domains might be fraudsters for which we need to react quickly.
  • in the Privacy Sandbox, we should never see the domain in the clear thanks to Fenced Frames if I am not mistaken. Hence a dictionary can never be updated in that case.

I hope this makes sense!

@csharrison
Copy link
Collaborator

@alois-bissuel yes it does make sense, thank you for clarifying!

@palenica
Copy link
Collaborator

FWIW, to me, (3) also seems the easiest choice from the API designer's perspective. It would be up to the adtech to encode the domain/URL/whatever they need using one or more 128 bit keys. We could consider changes to the aggregation service to allow discovery of keys (e.g. if the value for a given key exceeds a suitably large noisy threshold, we'd allow reporting on it even if the key has not been pre-declared in a domain file at query time). Such API extension might let you as an adtech implement something RAPPOR-like on top of it.
(Caveat: such change likely opens up additional privacy attack vectors; we'd want to move carefully.)

cc @cilvento

@jonasz
Copy link
Contributor

jonasz commented Oct 18, 2022

// +1 to option 3, discovery of keys sounds like a more general, useful mechanism.

@bmayd
Copy link

bmayd commented Nov 9, 2022

Impression source domain is a fundamental piece of information that factors into almost every ads use-case and as information allowing advertisers to identify who is being advertised to is reduced or eliminated by privacy preserving changes, the impression source domain will become a more critical factor in impression purchase decisions. Given that, any restrictions on the availability of source domain should be considered very carefully since limiting what can be reported limits what can be measured and that will have a direct impact on what inventory buyers will support: marketers are not going to buy impressions they can't tie to a source domain.

Given the importance of source domain, I suggest we consider making it a requirement that any measurement solution include it; if we don't, I think the degradation in usability will inhibit wide adoption and push participants to alternative, more privacy invasive, measurement tools and/or to shift their spend to contexts in which measurement is better supported.

With that preamble, could source domain be included in aggregatable reports as part of the encrypted payload? That would allow the potential of including it in the aggregation key when it had value for a specific report and the aggregation service could redact, filter or noise outputs to prevent revealing too much source domain related information.

@cilvento
Copy link

cilvento commented Nov 9, 2022

It seems like it could be helpful to call out when source domain is needed for cross-site measurement versus same-site measurement/reporting. For example, reporting that an ad was served on a given domain (modulo restrictions in FLEDGE reporting) is a same-site reporting use-case. This same-site reporting could also be helpful for scoping the set of possible source domains for key discovery, although it's not immediately clear to me how efficient this would be (particularly if conversion rates are low).

The case @alois-bissuel is outlining above for source domain discovery makes sense for ARA, but are there other use-cases that should be considered in the design? For example, are there other use-cases for unique reach, frequency reporting, etc that would require different source domain discovery methods in the Private Aggregation API? Or is just making the source domain available within encrypted reports a sufficient first step.

@bmayd
Copy link

bmayd commented Nov 11, 2022

@cilvento I've started a response, still thinking things through before I post it. In considering what you said, I occurred to me that I'm not entirely clear on how the "mechanism for key retrieval or discovery in the aggregation service" identified by @alois-bissuel in the 3rd option above would work. If someone could add a description that would be most appreciated.

@csharrison
Copy link
Collaborator

Hey @bmayd , I think what @alois-bissuel is referring to is something like the following:

  • Encode the source site directly in the aggregation keys (using the 128 bits or something else)
  • Add an additional query option to the aggregation service which doesn't expect you to provide a full enumeration of your desired output domain (which is required today)

This additional query option will often look like a histogram with a thresholding step applied (see this paper for some technical details). In this way, the query result helps you "discover" the non-zero keys i.e. which publishers saw conversions. This key discovery is not possible today due to the constraint that the output domain needs to be fully specified at query time.

@bmayd
Copy link

bmayd commented Nov 14, 2022

Thanks for the additional detail, @csharrison. The paper you refer to is rather opaque to me, but I gather the gist is that there must be sufficient value inputs from partition members before the partition is revealed. I in the context of A-ARA: source sites could act as partitions and could be included in outputs if there was enough contribution from them to assure their inclusion wouldn't provide information that might allow identification of specific inputs. Please let me know if I didn't get that right. Assuming my understanding is correct, I think the approach is reasonable and assume it would allow for reporting of top-converting source sites and an "other" bucket with something like a count of unspecified source sites and the conversions attributable to them.

In terms of the encoding the source site, I suggest including it in the encrypted payload and not as part of the 128-bit aggregation keys. Doing this would allow the aggregation service to control when source site was revealed, but also allow for a separation between the aggregation keys and source sites so they would not consume key-space which would reduce key complexity and allow for keys that could potentially be reused across campaigns, with source site being an additional bucketing option.

I think there would be other benefits from having source site available in the aggregation service, for example:

  • It could be used to generate a list of source sites in a campaign which could be compared to reporting from other sources to identify discrepancies.
  • It could be used to identify source sites which displayed anomalous behaviors indicating fraud or suspicious activity.

There are other sources of source site information, but sourcing it directly from the browser and through the aggregation service provides a unique point of validation coming from browsers via a protected channel vs other systems which are more subject to manipulation.

@csharrison
Copy link
Collaborator

Thanks for the additional detail, @csharrison. The paper you refer to is rather opaque to me, but I gather the gist is that there must be sufficient value inputs from partition members before the partition is revealed. I in the context of A-ARA: source sites could act as partitions and could be included in outputs if there was enough contribution from them to assure their inclusion wouldn't provide information that might allow identification of specific inputs. Please let me know if I didn't get that right.

Yeah, the idea is that any "key" could act as a partition. For this particular use-case you could imagine a key includes the source site, e.g. source_site x campaign is a key.

Assuming my understanding is correct, I think the approach is reasonable and assume it would allow for reporting of top-converting source sites and an "other" bucket with something like a count of unspecified source sites and the conversions attributable to them.

I think this is technically possible but we'd need to carefully design this functionality to be privacy preserving. It is not immediately available with the technique I linked (which just drops the buckets).

In terms of the encoding the source site, I suggest including it in the encrypted payload and not as part of the 128-bit aggregation keys

This is an interesting suggestion and I agree it comes with some benefits, but for completeness I think it's worth discussing some downsides:

  • Increases complexity of API surface of querying (now we need special query language to describe the source site)
  • Increases the payload size and compute for running queries (since now we are doing string comparisons in addition to integer comparisons).

There are other sources of source site information, but sourcing it directly from the browser and through the aggregation service provides a unique point of validation coming from browsers via a protected channel vs other systems which are more subject to manipulation.

Can you say more about this? Is the concern about a bad actor mutating an aggregation key, or about an aggregation key that is securely generated from bad information?

@csharrison csharrison changed the title Knowing the source site in the aggregation API Knowing the source site in the aggregation API / aggregate queries need key discovery mechanism Nov 17, 2022
@alois-bissuel
Copy link
Contributor Author

Sorry for the very late answer.

First of all, I completely support @bmayd's explanation of the need of having the domain in the reports.

I think that including the domain within the encrypted part of the report is an extremely good idea, which would balance nicely the security and usability properties of the API.

For the API surface, I reckon a simple interface could be created. For instance, we could query the provided keys without the domain (eg cross-domain reporting), or ask to get the provided keys crossed with the domain (and maybe have a further thresholding step @csharrison introduced). I don't think we need to specifically filter by source domain (eg I want only the reports for example.com), so we don't need a specific query language to describe the source site.

@bmayd
Copy link

bmayd commented Dec 23, 2022

There are other sources of source site information, but sourcing it directly from the browser and through the aggregation service provides a unique point of validation coming from browsers via a protected channel vs other systems which are more subject to manipulation.

Can you say more about this? Is the concern about a bad actor mutating an aggregation key, or about an aggregation key that is securely generated from bad information?

I was actually thinking of sources of site information that are outside the ARA entirely, such as ad-servers. As we become increasingly data limited, our ability to confidently corroborate claims is reduced, making information provided through a trusted channel, such as ARA, much more important.

For the API surface, I reckon a simple interface could be created. For instance, we could query the provided keys without the domain (eg cross-domain reporting), or ask to get the provided keys crossed with the domain (and maybe have a further thresholding step @csharrison introduced). I don't think we need to specifically filter by source domain (eg I want only the reports for example.com), so we don't need a specific query language to describe the source site.

I agree with @alois-bissuel here, I think it is a good start that addresses the majority of reporting needs and will give us a solid basis for evaluating the API. If it turns out there are significant unaddressed use-cases, we can consider them when they're surfaced.

@bmayd
Copy link

bmayd commented Feb 2, 2024

@csharrison Been a long time since we discussed this, but I wasn't able to find anything regarding the resolution. Will it be possible to get the impression source domain included in the encrypted payload so it is available for inclusion in group-by keys so we can report by source domain?

@csharrison
Copy link
Collaborator

The first step to this (introducing a key discovery mechanism) is still under consideration and I think is a prereq for this use-case. Once this is supported, the use-case of getting impression domains included is partially supported via hash-based methods (which as I understand from this thread are non-ideal). However, we can probably work from this foundation on more advanced techniques like encrypting the whole site, but we haven't made a ton of progress on that yet.

cc @keke123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants