Europe PMC

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


This document describes the aggregation and anonymization process applied to the initial version of Google COVID-19 Community Mobility Reports (published at http://google.com/covid19/mobility on April 2, 2020), a publicly available resource intended to help public health authorities understand what has changed in response to work-from-home, shelter-in-place, and other recommended policies aimed at flattening the curve of the COVID-19 pandemic. Our anonymization process is designed to ensure that no personal data, including an individual's location, movement, or contacts, can be derived from the resulting metrics. The high-level description of the procedure is as follows: we first generate a set of anonymized metrics from the data of Google users who opted in to Location History. Then, we compute percentage changes of these metrics from a baseline based on the historical part of the anonymized metrics. We then discard a subset which does not meet our bar for statistical reliability, and release the rest publicly in a format that compares the result to the private baseline.

Free full text 


arXiv

PPRID: PPR272259
EMSID: EMS113570
ArXiv preprint, version 4, posted 2020 November 03

Google COVID-19 Community Mobility Reports: Anonymization Process Description (version 1.1)

Copyright and license information

Copyright notice

This work is licensed under a CC BY 4.0 International license.

This article is a preprint. It may not have been peer reviewed.
A preprint is a complete scientific manuscript that an author uploads on a public server for free viewing. Initially it is posted without peer review, but may acquire feedback or reviews as a preprint, and may eventually be published in a peer-reviewed journal. The posting of preprints on public servers allows almost immediate dissemination and scientific feedback early in the 'publication' process.

Abstract

This document describes the aggregation and anonymization process applied to the initial version of Google COVID-19 Community Mobility Reports (published at http://google.com/covid19/mobility on April 2, 2020), a publicly available resource intended to help public health authorities understand what has changed in response to work-from-home, shelter-in-place, and other recommended policies aimed at flattening the curve of the COVID-19 pandemic. Our anonymization process is designed to ensure that no personal data, including an individual’s location, movement, or contacts, can be derived from the resulting metrics.

The high-level description of the procedure is as follows: we first generate a set of anonymized metrics from the data of Google users who opted in to Location History. Then, we compute percentage changes of these metrics from a baseline based on the historical part of the anonymized metrics. We then discard a subset which does not meet our bar for statistical reliability, and release the rest publicly in a format that compares the result to the private baseline.


COVID-19 Community Mobility Reports provide insights into changes in mobility patterns. These reports use anonymized, aggregated data to chart movement trends over time by geography, as well as by place categories, showing trends over several weeks. This works in a similar way to existing Google products and features. For example, Google Maps uses aggregated, anonymized data to show how busy certain types of places are, including when a local business tends to be the most crowded. Public health officials have suggested this same type of aggregated, anonymized data could also be helpful as they make critical decisions to combat COVID-19.

The COVID-19 Community Mobility Reports provide insights into what has changed in response to work-from-home, stay-at-home, and other recommended policies aimed at flattening the curve of the COVID-19 pandemic. They analyze trends in visits made to high-level categories of places, including workplaces, retail and recreational venues, groceries and pharmacies, parks, transit centers, and places of residence. Each version of the report will show trends over several weeks, with the most recent data representing 48 hours prior.

As explained in greater technical detail below, the anonymization process for these reports includes differential privacy [1], which is well-suited to produce analytics in contexts where the categories of data are known in advance. Our rigorous approach intentionally adds random noise to metrics in a way that maintains both users’ privacy and the overall accuracy of the aggregated data.

This paper is structured as follows: we introduce our method to produce anonymized metrics with differential privacy. We then explain how we post-process the anonymized metrics to generate the reports. Figure 1 summarizes the anonymization process.

Figure 1
Open in new tabFigure 1

System diagram of the metrics computation and anonymization process

1. Definitions

Location History users

The metrics in these reports are based on the data of Google users who have opted in to Location History [2], (“LH users”), a feature which is off by default.

Differential Privacy [3]

Let be a positive real number and A be a randomized algorithm that computes a metric. In the context of this report, A is considered -differentially private if for all input datasets D 1 and D 2 such that D 2 can be obtained from D 1 by adding or removing a single user’s data in a single day, and for all subsets of S ∈ imA: Pr[A(D1)S]eεPr[A(D2)S].

Granularity levels

The metrics are aggregated per day and per geographic area. There are three levels of geographic areas; in this paper, we call these granularity levels.

  • Granularity level 0 corresponds to metrics aggregated by country / region.
  • Granularity level 1 corresponds to metrics aggregated by top-level geopolitical subdivisions (e.g. US states).
  • Granularity level 2 corresponds to metrics aggregated by higher-resolution granularity (e.g. U.S. Counties).

Granularity levels 1 and 2 are defined differently in different countries, to account for knowledge of local public-health needs. Note that in general, the geographic area represented gets smaller as the granularity level increases. No metrics are published for geographic regions smaller than 3km2.

2. Generating Anonymized Metrics

We are releasing aggregated, anonymized data that is designed to ensure that no personal data, including an individual’s location, movement, or contacts, can be derived from the resulting metrics. To that end, we anonymize the statistics with differential privacy. We query the underlying data using our open-source differential privacy library [4], which adds Laplace noise [5] to protect each metric with differential privacy.

2.1. Daily Visits in Public Places

We count the number of unique LH users who visited a public place of a given category in a given day at each granularity level. There are seven different categories derived from the data: retail, recreation, eateries (reported as part of “Retail & recreation”); groceries, pharmacies; transit; and parks. We add Laplace noise to each count according to the following table.

For each location (at all geographic levels), each LH user can contribute at most once to each category. We also bound the contribution of each LH user to 4 〈category,location〉 pairs per day and per geographic level, using a process similar to the one described in this paper [6]: if an LH user contributes to more than 4 pairs in a given day and given geographic level, we randomly select 4 of them, and discard the others.

For example, suppose that on the same day, an LH user goes to public places in all 7 categories in two distinct neighboring countries. This makes a total of 14 〈category,location〉 pairs at country level. We would randomly discard 10 of these pairs when computing country-level statistics.

This process does not significantly affect data accuracy: in the US, at county level, 99% of LH users contribute 3 or fewer 〈category,place〉 pairs per day on average. Thus, each daily place visit is protected by differential privacy with = 0.44. These multiple metrics apply to the same dataset, so standard composition results apply, and the total daily contribution of each user is protected by differential privacy with a maximum of = 1.76.

2.2. Residential

For the purposes of this analysis, we use signals like relative frequency, time and duration of visits to calculate metrics related to places of residence. We calculate an average amount of time spent at places of residence for LH users in hours. This computation is performed for each day and geographic area, using the same algorithm as the differentially private mean mechanism from our open-source library [7]. This mechanism works as follows:

  • We compute the amount of time spent at place of residence in a given day and geographic area in hours by summing up the individual values per user offset by 12, so all individual values fall into the range [−12; 12]. We then add Laplace noise to this sum; the scale of the noise is indicated in the table below. We denote the real sum s, and noisy sum sn .
  • We compute the count of unique users who spent any time at residences in a given day and geographic area. We then add Laplace noise to this count; the scale of the noise is indicated in the table below. We refer to the real count c, and the noisy count cn .
  • Finally, we compute the ratio sn/cn for each day and each geographic area, add 12 as offset, and clamp it to the range [0, 24] hours/day.

For example, at county-level, sn is obtained by first sampling a random number from a Laplace distribution of scale 109.1, and then adding that number to s. In the table below, we also indicate the standard deviation σ of the noise added to each value.

Each user can contribute to at most one region per granularity level, which protects these metrics by differential privacy with = 0.44 total budget across all granularities. A description of the differentially private mean mechanism implemented and a proof of its privacy guarantees is described in [8] (Algorithm 2.4).

2.3. Workplaces

For the purposes of this analysis, we use signals like relative frequency, time and duration of visits to calculate metrics related to places of residence and places of work of LH users. We calculate how many LH users spent more than 1 hour at their places of work. This computation is performed for each day and geographic area. Then, we add Laplace noise to each count according to the following table.

The count is aggregated by places of residence of LH users. Since each user can contribute to at most one geographic area per granularity level, these metrics are protected by differential privacy with = 0.44.

3. Generating the Report From the Anonymized Metrics

The metrics described above are generated for each day, starting on 2020-01-01. They are then used to generate the percentage changes relative to day of the week published in the reports. All operations described below use only the output of the differentially private mechanisms described in the previous section; so they do not consume any privacy budget.

Additional privacy protections

We discard all metrics for which the geographic region is smaller than 3km2, or for which the differentially private count of contributing users (after noise addition) is smaller than 100. Geographic regions smaller than 3km2 may be merged such that the union of their area is above the 3km2 threshold. This merging does not occur across country boundaries, except for the Vatican City and Italy.

3.1. Computing Percentage Changes from a Baseline

For each individual metric generated using the mechanisms described above, we compute the ratio between the metric for a given day D and the same metric computed for the baseline period. The reference baseline is defined in the following way.

  • We consider the 5-week range from 2020-01-03 through 2020-02-06. This ranged is fixed.
  • Within this 5-week range, we consider the 5 days with the same day of week as d. For example, if d is 2020-03-20, d is a Friday, so we consider the 5 Fridays in this 5-week range (Jan 3 to Jan 31, inclusive).
  • We compute the median of the differentially private metrics for these 5 baseline days.
  • This median metric is the baseline metric for d.

We then compute and publish the ratio between the metric for d and the baseline metric, as a percentage.

3.2. Removing Unreliable Metrics

In some regions, the noise added to obtain differential privacy can reduce the confidence that we are capturing a meaningful change, typically when there is not a lot of data for the metric. When, because of this uncertainty, the percentage change for one of these metrics has a 5% chance (or higher) of being wrong by more than ±10 absolute percentage points, we do not publish it and instead include an asterisk denoting that there is not enough data available to present privacy-safe information. More precisely:

  • Before releasing a ratio metric/baseline, we compute 97.5% confidence intervals for the metric and its baseline. Let us denote [m min, m max] and [b min, b max] these respective confidence intervals.
  • We compute the ratios m min/b max and m max/b min.
  • If one of these ratios differs from the differentially private ratio by more than 10 absolute percentage points, we do not publish the corresponding percentage changes.

If the last condition is not satisfied, then the probability of being wrong by more than 10 absolute percentage points in each direction is lower than 2.5%. By union bound, this means that there is at most a 5% risk of being wrong by more than 10 absolute percentage points. Note that the confidence intervals are based on an already differentially private value and on public data (the scale and shape of the noise), so no privacy budget is consumed by this operation.

4. Note on δ

We are generating a fixed set of metrics (all possible combinations of geographic regions, days within the periods, and public place categories), and we are also adding noise to zero-valued metrics. As such, the process outlined above is -differentially private with δ = 0.

5. Improving the Accuracy of Metrics Over Time

We are continuously making improvements to the underlying computation of the metrics to improve their accuracy over time. These updates can introduce a shift in their values, which can skew comparisons over time compared to the baseline period if they are not accounted for, since those improvements are not applied to the baseline period (to avoid republishing the data). To roll out these changes with minimal impact on the overall privacy budget, we use scaling factors. Rather than recomputing a metric directly, which would require a larger value of , we use the following process.

  1. We define groups of metrics, on which the effect of the update is uniform. For example, for a specific metric and location, we could consider all days in a given period, or group days by weekday within this period (e.g. Parks metric across all Tuesdays in June in the US, or Workplaces metric across week-ends in August in each region of granularity level 1).
  2. For each group, we sum the noisy metrics already generated with the previous computation logic. Call this sum sg .
  3. For each group, we sum the metrics recomputed for the same given period with the new computation logic, then we add noise to it with a smaller privacy budget (typically, 10%), proportional to the budget we use for the corresponding region granularity. Call this noisy sum sn .
  4. For dates following the given period, we multiply the metrics generated with the new logic by the scaling factor sn/sg (if we are scaling the baseline) or by its inverse sg/sn (if we are scaling the daily counts).

Grouping multiple metrics together to compute this scaling factor is the key insight that allows us to use a much smaller privacy budget for step 3, and reuse already generated metrics in step 2.

Table 1

Noise parameters used for the daily visits in public places metrics

Granularity level Scale of Laplace noise Corresponding ε parameter
0 1/0.11 ≈ 9.09 (σ ≈ 12.86) 0.11
1 1/0.11 ≈ 9.09 (σ ≈ 12.86) 0.11
2 1/0.22 ≈ 4.55 (σ ≈ 6.43) 0.22
Table 2

Noise parameters used for the residential metrics

Granularity level Scale of Laplace noise:
sum (total hours/day)
Scale of Laplace noise:
count (number of users)
Corresponding
ε parameter
0 12/0.055 ≈ 218.2 (σ ≈ 308.6) 1/0.055 ≈ 18.2 (σ ≈ 25.71) 0.11
1 12/0.055 ≈ 218.2 (σ ≈ 308.6) 1/0.055 ≈ 18.2 (σ ≈ 25.71) 0.11
2 12/0.110 ≈ 109.1 (σ ≈ 154.3) 1/0.110 ≈ 9.09 (σ ≈ 12.86) 0.22
Table 3

Noise parameters used for the work places metrics

Geographic level Scale of Laplace noise Corresponding ε parameter
0 1/0.11 ≈ 9.09 (σ ≈ 12.86) 0.11
1 1/0.11 ≈ 9.09 (σ ≈ 12.86) 0.11
2 1/0.22 ≈ 4.55 (σ ≈ 6.43) 0.22

Author Information

References

History

  • Posted November 03, 2020.