Measurement testing guide

The goal of this guide is to provide guidance on running a standalone test of Privacy Sandbox Attribution Reporting API. For more details, see: Section 12.

Measurement of control and treatment arm results in CMA's experimental designs 1 and 2 is covered in the Relevance APIs testing guidance, since the goal of those experiments is to test the efficacy of using Protected Audience & Topics. For more details, see Section 11.

Before you begin

Review for guidance on configuring and setting up Attribution Reporting API.
Review CMA testing guidance: Experiments note (November 2022), Testing guidance (June 2023) and Additional testing guidance (October 2023).

Evaluation goals and proposed experiment setup

Goal 1 - Determining efficacy of Attribution Reporting API for reporting

We propose an A/A setup to measure impact to reporting

This proposal aligns with CMA guidance on evaluation of conversions-based metrics. For more details, see Section 21 and Section 12.
We prefer this method over Mode A/B because testing the Attribution Reporting API (ARA) can be done by simultaneously measuring conversions on the same set of impressions using two different measurement methodologies (third-party cookies + non-third-party cookie data and ARA + non-third-party cookie data).
An A/A experiment also isolates the impact of the Attribution Reporting API on conversion measurement (for example, it avoids any changes to conversion rates due to lack of third-party cookies).

Suggested points of analysis

Pick a slice of traffic that is large enough to get statistical significant results and has both third-party cookies and Privacy Sandbox APIs. Ideally this is all traffic, except Mode B (which disables third-party cookies).
- We recommend excluding Mode B from the A/A experiment, since third-party cookies won't be available and you won't be able to compare ARA results against third-party cookie-based attribution results.
- If you would like to include Mode B, you should consider enabling debug reports for the Mode B slice of traffic. Debug reports will help you troubleshoot any configuration or implementation issues.
If you plan to test on a smaller slice of traffic, we expect you'll receive noisier-than-expected measurement results. We recommend noting in your analysis what fraction of traffic was used and whether you are reporting results based on noised reports or un-noised debug reports.
- For summary reports, your summary values will likely be lower, and Aggregation Service will add noise from the same distribution regardless of the summary value.
Test different measurement methodologies on that slice of traffic
- Control 1 - Use current measurement methodologies (third-party cookies + non-third-party cookies data)
- (optional) Control 2 - no Privacy Sandbox and no third-party cookies, that is, only non-third-party cookie data
  - Note that there could be some third-party cookies still available to some sites - for most accurate results, don't use those third-party cookies for measurement in the Control 2 or Treatment methodologies
- Treatment - Privacy Sandbox APIs and non-third-party cookies data
  - Note that there could be some third-party cookies still available to some sites - for most accurate results, don't use those third-party cookies for measurement in the Control 2 or Treatment methodologies

Metrics

Define which metrics make sense for your business to measure outcomes, and include a description of what the metric means and how it's being measured.
- We suggest focusing on dimensions and metrics that are important for your advertisers. For example, if your advertisers focus on purchase conversions, measure conversion counts for those and purchase value.
Metrics based on count or sum (for example, conversion rate) are more ideal to work with as opposed to cost per (for example, cost per conversion). For A/A analysis, cost metrics can be fully derived from the count or sum conversion values.
Specify whether the metrics are based on Event-Level Reports, Summary Reports, or a combination of both reports (and whether debug reports were used).
See the suggested template tables for guidance on how to format quantitative feedback.

Analysis

Coverage:
- Are you able to measure across a similar set of users as compared to third-party cookie? Do you see higher coverage (for example with app-to-web)?
- Are you able to measure the conversions (and dimensions or metrics) you, or your advertisers care most about?
Quantitative feedback
- On advertiser reporting, for example, what percentage of key conversions would you be able to report for that advertiser, or what percentage of campaigns meet a reporting quality bar (deriving a quality bar helps adjust for campaigns with small conversion counts)
- Sliced by advertiser, for example, are there some advertisers who are more or less dependent on third-party cookies for reporting today?
Other qualitative feedback:
- How does ARA affect the complexity of advertisers' measurement/attribution setup?
- Does ARA help or hinder advertisers in focusing on the metrics and goals that matter to them?

Suggested template tables for reporting impact

(Reporting) Table 1:

Example template table for reporting experimental results to the CMA (taken from page 18, but testers should consider what metrics are most meaningful / feasible to provide and adapt the table as needed).

	Treatment vs Control 1 Compares proposed end state with current state	Treatment vs Control 2 Compares proposed end state with no PS APIs at all.	Control 2 vs Control 1 Compares conversion measurement with and without third-party cookies, without any PS APIs.
*Measurement Methodology*	Compare conversion measurement for Treatment (ARA with non-third-party cookie data) against Control 1 (third-party cookie and non-third-party cookie data)	Compare conversion measurement for Treatment (ARA with non-third-party cookie data) against Control 2 (non-third-party cookie data only)	Compare conversions measurement for Control 2 (non-third-party cookie data only) against Control 1 (third-party cookie and non-third-party cookie data)
Conversions per dollar	Effect	Effect	Effect
	Standard error	Standard error	Standard error
	95% confidence interval	95% confidence interval	95% confidence interval
Total conversions	Effect	Effect	Effect
	Standard error	Standard error	Standard error
	95% confidence interval	95% confidence interval	95% confidence interval
Conversion rate	Effect	Effect	Effect
	Standard error	Standard error	Standard error
	95% confidence interval	95% confidence interval	95% confidence interval
(add your own metrics)

(Reporting) Table 2:

Example template table for reporting descriptive statistics for metrics in the treatment and control groups (taken from page 20, but testers should consider what metrics are most meaningful / feasible to provide and adapt the table as needed).

Metric	Treatment Conversion measurement using ARA and any non-third-party cookie data you use	Control 1 Conversion measurement using third-party cookies and any non-third-party cookie data you use	Control 2 Conversion measurement using non-third-party cookie data only
Conversions per dollar	Mean	Mean	Mean
	Standard deviation	Standard deviation	Standard deviation
	25th and 75th percentile	25th and 75th percentile	25th and 75th percentile
Total conversions	Mean	Mean	Mean
	Standard deviation	Standard deviation	Standard deviation
	25th and 75th percentile	25th and 75th percentile	25th and 75th percentile
Conversion rate	Mean	Mean	Mean
	Standard deviation	Standard deviation	Standard deviation
	25th and 75th percentile	25th and 75th percentile	25th and 75th percentile
(add your own metrics)

Goal 2 - Determining efficacy of Attribution Reporting API for bidding optimization

We propose an A/B setup to measure impact to bidding optimization.

To measure impact to bidding optimization, you will need to train two different machine learning models and use them on two slices of traffic - one model trained on current measurement methodologies (third-party cookies + non-third-party-cookie data) to be applied to the control arm, and one model trained on Attribution Reporting API + non third-party-cookie data to be applied to the treatment arm.
The model training should be based on as much traffic as the tester deems necessary to maximize performance, even if the treatment arm is a smaller slice of traffic and there is overlap between the training populations (for example, use the existing third-party cookie model that is training on all traffic, and train the ARA model on all ARA traffic enabled for Goal 1).
- If submitting results to the CMA, note if there is a significant difference between traffic slices used for training different models (for example, if third-party cookie-based models are trained on 100% of traffic but ARA-based models are only trained on 1% of traffic).
If possible, the training for both treatment and control bidding models should take place for the same length of time.
Consider whether you should continuously train and update the bidding models during the experiment, and if you do, whether you should train on as much traffic as possible or only on traffic from the treatment and control arms.
The different models should be used on disjoint slices of traffic as an A/B experiment. For user randomization and assignment across treatment and control arms, we recommend using Chrome-facilitated labeled browser groups (Mode A) or running your own experiment with randomized sets of browsers. We do not recommend using Mode B as the lack of third-party cookies will make it difficult to report on conversion-based metrics.
- Chrome-facilitated browser groups will exclude some Chrome instances like Enterprise Chrome users, where your own randomized sets of browsers may not exclude these Chrome instances. Therefore, you should run your experiment only on Mode A groups, or only on non-Mode A/Mode B groups to avoid comparing metrics obtained on Chrome-facilitated groups with metrics obtained outside of Chrome-facilitated groups.
- If not using Chrome-facilitated labeled browser groups (for example, running experiment on other traffic):
  - Ensure that the treatment and control split of users is randomized and unbiased. Regardless of experiment group setup, evaluate characteristics of treatment and control arms to ensure treatment and control groups are comparable. (See: Section 15)
  - Ensure that user characteristics and campaign configurations of treatment and control groups are the same (for example, use similar geos in both treatment and control groups). (See: Section 28)
    - Specific examples include: ensure similar conversion types are being measured using the same attribution window and same attribution logic, the campaigns are targeting similar audiences, interest groups, and geos and using similar ad copy and ad formats.
  - Ensure that initial population sizes for treatment and control groups are large enough to have flexibility for bidding and experimentation.
- If using Chrome-facilitated labeled browser groups (Mode A), the randomization of Chrome browser instances to groups is handled by Chrome. It is recommended that you check, as before, that the randomization results in unbiased / comparable groups for your purposes.

Suggested points of analysis

We recommend defining control and treatment arms, and using a different machine learning model for bidding optimization for each arm:
- Control 1 - Use the bidding optimization model trained on current measurement methodologies (third-party cookies + non-third-party cookies data)
- (optional) Control 2 - Use the bidding optimization model trained on no Privacy Sandbox and no third-party cookies, that is, only non-third-party cookie data
  - Note that there could be some third-party cookies still available to some sites—for most accurate results, don't use those third-party cookies for measurement in the Control 2 or Treatment methodologies.
- Treatment - Use the bidding optimization model trained on Attribution Reporting API and non-third-party cookies data
  - Note that there could be some third-party cookies still available to some sites—for most accurate results, don't use those third-party cookies for measurement in the Control 2 or Treatment methodologies.

Metrics

Define which metrics make sense for your business to measure outcomes, and include a description of what the metric means and how it's being measured.
- For example, the meaningful metric could be spend (publisher revenue), which aligns with CMA's guidance to understand the impact of third-party cookie deprecation on "Revenues per impression". See Section 19 for more details.
If reporting on any conversion-based metrics, you should use the same measurement methodology for each arm, to avoid multivariate testing (testing the impact on optimization and reporting in one experiment). See the suggested template tables for guidance on how to format quantitative feedback.
Consider other ways to gather metrics on bidding optimization impact - for example, using simulating bids. Are there any simulated metrics that would be useful to understand the impact of third-party cookies and ARA on your bidding models?
Specify whether the metrics are based on Event-Level Reports, Summary Reports, or a combination of both reports (and whether debug reports were used).

Analysis

Coverage:
- Are you able to measure across a similar set of users as compared to third-party cookie? Do you see any changes in coverage (for example with app-to-web)?
- Are you able to measure the conversions (and dimensions/metrics) you, or your advertisers care most about?
How would the differences between the groups impact the following:
- Advertiser reporting, for example,. what percentage of key conversions would you be able to report.
- Training and optimization, for example, simulate the impact of different conversion data on model performance.
Other qualitative feedback:
- How does ARA affect the complexity of advertisers' bidding optimization setup?
- Does ARA help or hinder advertisers from focusing on the metrics and goals that matter to them?

Suggested template tables for bidding impact

(Bidding) Table 1:

Example template table of experimental results that market participants should submit to the CMA (taken from page 18, but testers should consider what metrics are most meaningful / feasible to provide and adapt the table as needed).

	Treatment vs Control 1 Compares proposed end state with current state	Treatment vs Control 2 Compares proposed end state with no PS APIs at all.	Control 2 vs Control 1 Compares bidding optimization with and without third-party cookies, without any PS APIs.
*Measurement Methodology*	To avoid multivariate testing, use third-party cookie and non-third-party cookie data to measure conversion-based metrics for both arms in each experiment.
Revenues per impression	Effect	Effect	Effect
	Standard error	Standard error	Standard error
	95% confidence interval	95% confidence interval	95% confidence interval
(Add your own metrics)

(Bidding) Table 2:

	Treatment Bidding optimization using ARA and any non-third-party cookie data you use	Control 1 Bidding optimization using third-party cookies and any non-third-party cookie data you use	Control 2 Bidding optimization using non-third-party cookie data only
*Measurement Methodology*	To avoid multivariate testing, use third-party cookie and non-third-party cookie data to measure conversion-based metrics across all arms.
Revenues per impression	Mean	Mean	Mean
	Standard deviation	Standard deviation	Standard deviation
	25th and 75th percentile	25th and 75th percentile	25th and 75th percentile
(add your own metrics)

Goal 3 - Load testing the Aggregation Service

See Aggregation Service Load Testing Framework.