Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Aggregated API] Using the API for both low latency reactive monitoring and detailed client reporting #732

Open
alois-bissuel opened this issue Mar 23, 2023 · 5 comments
Labels
possible-future-enhancement Feature request with no current decision on adoption

Comments

@alois-bissuel
Copy link
Contributor

Hello,

There are some cases where we want to use the aggregated API for two uses cases which are quite different:

  1. A low latency campaign monitoring system, where knowing with little delay attributed sales is paramount for correct delivery. Little to no delays can be
  2. Detailed client reporting where precision and richness of the data presented are key. Here delays are more acceptable (ie some arbitrage can be made between delay and signal-to-noise ratio).

We struggle to articulate the two use cases within the API in its current form. Because the data can be only processed once, we have to sacrifice one of the use cases (ie either use one detailed encoding and process the data hourly, meaning use case 2. gets drowned by noise, or process data daily and sacrifice use case 1.).

Supporting the two use cases at the same time could be done by allowing several passes of the data in the aggregation service. To keep the differential privacy properties of the aggregation service, we could keep track of the already consumed budget (i.e. the first pass used ε/4, the second ε/2, and the last ε/4). Another approach would be to define broad key spaces (e.g. split the 128 bit space in 4 buckets), and allow the aggregation only once per key space. This way one would encode in the first key space the fast-paced campaign monitoring metrics and query the aggregation service hourly for them, and encode the client reporting metrics elsewhere and aggregate them weekly.

Both methods have their pros and cons, the latter being more precise (as one doesn't burn some of my budget for the two use cases at the same time), and the former enabling to have less regret (ie on can always reserve some budget for a last aggregation in case of a mistake).

For both methods, the storage space of the aggregation service. can be controlled by setting a sensible but low limit on the number of times the data can be processed.

@csharrison
Copy link
Collaborator

cc @hostirosti @ruclohani for visibility.

Thanks for filing, @alois-bissuel . I agree a more flexible way of consuming privacy budget should help satisfy the use case. Between use-case (1) and (2) you mention, do you expect the keys to be similar or will e.g. (2) query finer grained slices?

@RonShub
Copy link

RonShub commented Mar 30, 2023

Hello,
+1 for the hourly and daily use case request

It would be super useful for us if it would be possible to increase the limit from 1 to 2. i.e. that the same aggregatable report could appear in two batches and hence contribute to two summary report.
This is crucial for our clients for two main reasons:

  1. Facilitate our hourly->daily (or possibly daily->weekly if SNR is too low) pulls strategy, which will enable our clients to get noiser data first, and data with improved SNR later
  2. Fault-tolerance: in case an internal issue happens on our side on the hourly request and we lost the aggregation service response, we would be able to fallback to the daily response (and vice versa)

For our use case, we expect the keys to be similar.

@michal-kalisz
Copy link

Hi,
@alois-bissuel thanks for bringing this up. We have similar cases and addressing them seems very valuable, especially from an operational point of view.
At first glance, the second solution (dividing 128 bits into several buckets) seems to be better.

@csharrison
Copy link
Collaborator

@alois-bissuel quick clarification, you said:

Both methods have their pros and cons, the latter being more precise (as one doesn't burn some of my budget for the two use cases at the same time)

Is this true? If we split the key space into a few sections and allowed you to query those sections independently, you still need to allocate separate budget across those key spaces, just in the form of the client's L1 contribution bound rather than an epsilon.

Is this because the two use-cases will actually have different data / keys, and so querying the "high latency / detailed" key space during a low latency query is wasteful?

@alois-bissuel
Copy link
Contributor Author

Catching up on my issues, sorry for the delay:

Thanks for filing, @alois-bissuel . I agree a more flexible way of consuming privacy budget should help satisfy the use case. Between use-case (1) and (2) you mention, do you expect the keys to be similar or will e.g. (2) query finer grained slices?

As there will be less data available for use-case (1) (ie the fast-pace aggregation case), I expect us to encode less things in the key to have more data aggregated per key.

Both methods have their pros and cons, the latter being more precise (as one doesn't burn some of my budget for the two use cases at the same time)

Is this true? If we split the key space into a few sections and allowed you to query those sections independently, you still need to allocate separate budget across those key spaces, just in the form of the client's L1 contribution bound rather than an epsilon.

Indeed, I was not clear there. I was thinking of separate encoding for both use case and thus a finer budget tracking. Of course the L1 budget still applies. I guess that my first proposal (ie allocating an epsilon budget per pass) rules out a different encoding per use case, hence my remark (and your final comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
possible-future-enhancement Feature request with no current decision on adoption
Projects
None yet
Development

No branches or pull requests

4 participants