Comments of
Latanya Sweeney, Ph.D.,
Director, Laboratory for International Data Privacy
Assistant Professor of Computer Science and of Public Policy
Carnegie Mellon University
To the Department of Health and Human Services
On "Standards of Privacy of Individually Identifiable Health Information"

Thank you for this opportunity to provide comments regarding proposed amendments to Sec. 164.514(a)-(c) of the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Please make these comments part of the public record.

These comments pertain to: (1) the elimination of statistical de-identification in favor of a limited data set; and, (2) the data elements that would constitute a limited data set. Within these comments, I refer to previous work I performed on determining the identifiability of individuals in the United States based on combinations of demographic values. An example of my findings is that 87% of the population of the United States is likely to be uniquely identified by {5-digit ZIP, gender, date of birth}. I also report on available, inexpensive computational solutions that exist and offer some solutions. The Privacy Rule need not limit itself to decisions about releasing data that provides privacy or enables research, but can in fact accommodate both � privacy and research. Computational solutions release data sets that are useful for research while maintaining scientific guarantees of privacy protection.

  1. Elimination of Statistical de-identification

    The elimination of statistical de-identification in favor of a limited data set removes economic incentives for developing new technology that can solve many privacy concerns and ignores the availability of computational solutions that are currently available to resolve some issues. In fact, statistical de-identification should not be limited to an individual (as the proposed modifications suggest) but should be focused around scientific standards, which include computational solutions. Modifying the regulation to eliminate computational solutions altogether, presumably in favor of a limited data set, fails to recognize the ability of these systems to provide more useful data than is possible with limited data sets, and to provide scientific guarantees of anonymity, which are not provided by limited data sets.

    1. Computational solutions

      There are several systems currently available that render field-structured data sufficiently anonymous. These are: Datafly [Sweeney, 1997], Statistics Netherlands� Mu-Argus [Hundepool and Willenborg, 1996], and k-Similar [Sweeney, 2001]. These systems are available for use, robust and inexpensive. There are also systems that de-couple identity from data, such as the one provided by PersonalPath Systems, Inc. and algorithms that address genetic data [Malin and Sweeney, 2000, 2001]. Given the economic opportunity, many other solutions will emerge as well. In fact, computer science research is currently underway to use cryptographic protocols and distributed data mining algorithms to protect privacy while enabling data sharing of all kinds of person-specific data for bioterrorism, counterterrorism and video surveillance purposes. These new tools will also be useful for rendering health information sufficiently anonymous. Modifications to the proposed Privacy Rule should not prohibit the use of these kinds of solutions.

      The Federal Register report on modifications to the proposed Privacy Rule [March 2002] notes that past commenters were largely silent on using the alternative statistical method to de-identify information. I suspect this is due in great part to a lack of knowledge of statistical solutions, which in the Register now only refers to two working papers. Computational solutions are even newer, but are easy to use and produce better data quality than possible with limited data sets. Some recognition of the merit of computational solutions has occurred. The Scrub System [Sweeney, 1996], which de-identifies text documents, won first prize from the American Medical Informatics Association in 1996, and the Datafly System received an award from the American Medical Informatics Association in 1997.

    2. Limited data set is not as good as computational solutions

      Limited data sets are defined by listing which data elements are (or are not) to be included. Such decisions are based on gross assumptions about the identifiably of the values associated with these data elements. There are many myths. Here are some facts. As noted earlier, {date of birth, gender, 5-digit ZIP} combine to uniquely identify about 87% (216 million of 248 million) of the population in the United States. About half of the U.S. population (132 million of 248 million or 53%) is likely to be uniquely identified by only {place, gender, date of birth}, where place is basically the city, town, or municipality in which the person resides. And even at the county level, {county, gender, date of birth} are likely to uniquely identify 18% of the U.S. population. In general, few characteristics are needed to uniquely identify a person.

      Computational solutions for rendering data sufficiently anonymous, in comparison with limited data sets, make decisions on a patient-by-patient basis, whereas decisions on limited data sets are based on which data elements to include (or not) and the amount of specificity of the values. For example, about 12% of 54,805 voters in the City of Cambridge, Massachusetts are uniquely identified by {date of birth, city}. Knowing this statistic alone does not tell us which individuals comprise that 12%, so if a decision must be based on these data elements, as is the case with limited data sets, then either city or date of birth must be generalized or removed. On the other hand, computational solutions make decisions on a person-by-person basis and would thereby review the data and recognize which people are uniquely identified and modify only their values. Values associated with others could possibly remain unchanged. For example, when William Weld was the Governor of Massachusetts, he lived in Cambridge, Massachusetts. There were 6 people who had his same date of birth, only 3 of them were men, and he was the only one in his 5-digit ZIP code. Knowing these details about him in particular makes it easier to protect his information by identifying which values are sensitive for him. Computational solutions provide this kind of protection to everyone.

      As this example shows, computational solutions can provide more detail in released data, which makes the data far more useful for many of the concerns noted in the Federal Register report on modifications to the proposed Privacy Rule [March 2002]. These include: conducting and disseminating analyses that are useful for hospitals in making decisions about quality and efficiency improvements, providing more geographic specification down to 5-digit ZIP to report injury or illness; and, post-marketing surveillance registries for the FDA to which healthcare providers report problems. These more detailed data releases are possible because computational solutions allow data holders to specify which data elements are important to the researcher and then to modify other data elements (and records) as needed to achieve a given level of anonymity.

    3. Safe harbor provision has limitations too

      The safe harbor currently requires removal of all 18 enumerated identifiers, including direct identifiers such as name, street address and Social Security number, as well as other identifiers, such as birth date, admission date and discharge date, and 5-digit ZIP code.

      The safe harbor method has two significant shortcomings:

      • Some of the 18 identifiers are needed for adequate and proper data analysis. Removing them can distort the data horribly and render it virtually useless for analysis. Several previous commenters have already voiced this concern, although there has been little consensus among them regarding which of the 18 identifiers is critical to data analysis.

      • Removing those 18 identifiers is not an effective way to protect privacy, anyway, as the remaining data often can be reidentified by comparing it with publicly available information contained outside the database, as is demonstrated below.

      For example, in the case of Southern Illinoisan, a division of Lee Enterprises, Incorporated v. Department of Public Health, State of Illinois, the data elements {month and year of diagnosis, type of cancer, 5-digit residential ZIP} were requested for some patients from the state�s cancer registry under a Freedom of Information Act (FOIA) request. I was able to show by experimentation that such a release could be accurately re-identified to the patients who are the subjects of the information using publicly available health data that would not be covered by the Privacy Rule. Here are some results based on a random sample of 23 patients. I identified 20 of the 23 sampled (or 87%), uniquely identified 18 of the 23 sampled (or 78%) and incorrectly identified 0 of the 23 sampled (or 0 %). The availability of publicly available health information from non-covered entities makes it extremely difficult to determine which data elements in a limited data set are �safe.�

      As an example of the kinds of health information that are publicly available, 44 of the 50 states currently collect hospital discharge data with about half of them providing a publicly available version. Unfortunately, most of these publicly available data sets include diagnoses, admissions dates, and service dates, for example, and so, these data elements could be used to re-identify data, even data released under the modifications proposed to the Privacy Rule.

      In response to commenters� concerns that removal of all 18 identifiers can render data unusable, HHS has requested comment on a modification that would permit the use and disclosure of some of those identifiers, including admission date; discharge date; and service date; age; 5-digit ZIP and; if applicable, date of death. While this would make the data more useful, it also would greatly increase the probability that the data could later be reidentified, further threatening patients� privacy.

      A more effective alternative to the safe harbor can be provided by "computational solutions," which balance the need for data utility with the need for patient privacy. A safe harbor requires determining in advance � and in a �one-size-fits-all� fashion � which data elements of all patients� records are to be routinely suppressed and which are to be released. In contrast, computational solutions evaluate each patient�s individual data set, determining which data element or combination of elements makes it possible to identify that individual and suppressing only the elements that make the particular individual identifiable.

      Here is an example. The most populated ZIP code in the United States is 60623, which has approximately 112,167 people, yet there are insufficient numbers of people over the age of 55 living there so that releases containing {date of birth, gender, 5-digit ZIP} for them tend to be unique. On the other hand, ZIP code 11794 has about 5,418 people residing within, but most of them are between the ages of 19 and 24. This ZIP code is Suny, New York, which houses a state university. Despite the small number of people residing in the ZIP code, information on people between the ages of 19 and 24 can often be released having {date of birth, gender, 5-digit ZIP} while still protecting privacy. Knowing the specifics of what makes a person identifiable and then making decisions based on that knowledge allows far more detailed data to be released with scientific guarantees of privacy protection.

  2. Data elements in a limited data set

    Various combinations of data elements were presented in the proposed modifications to the Privacy Rule. These are presented below and the identifiability issues of each are discussed. It should be noted however, that any data element could be used as the basis for re-identification, so restricting discussion to a few combinations of elements falsely implies that resolving the problem over those data elements (and removing the explicit identifiers) is sufficient. That is not usually the case. Nevertheless, in the next three subsections, I address the identifiability of the combinations of data elements noted in the proposed modifications.

    1. {admission date, service date, discharge date, date of death, age, 5-digit ZIP}

      It is important to note that this combination does not explicitly address gender, though in many cases it can be inferred by type of service, diagnosis code or procedure code. The combination of {year of birth, 5-digit ZIP, gender} is likely to uniquely identify only 0.04% of the population in the United States. This is an aggregate figure for the entire country and is not universally applicable. For example, 18.1% of the population of Iowa and 26.5% of the population of North Dakota are likely to be uniquely identified by these data elements.

    2. Geographical specification, 5-digit ZIP or other smaller than a state

      As noted earlier, {date of birth, gender, 5-digit ZIP} combine to uniquely identify about 87% (216 million of 248 million) of the population in the United States. About half of the U.S. population (132 million of 248 million or 53%) are likely to be uniquely identified by only {place, gender, date of birth}, where place is basically the city, town, or municipality in which the person resides. And even at the county level, {county, gender, date of birth} are likely to uniquely identify 18% of the U.S. population.

    3. Date of birth and not age

      The Federal Register report on modifications to the proposed Privacy Rule [March 2002] was clear to note that while the specification calls for age, values can be specified in days or even hours, thereby being more specific than date of birth. Of course, when age is reported in years, the resulting values for age are more general than for date of birth. Here are some facts related to age specifications.

      As noted earlier, {date of birth, gender, 5-digit ZIP} combine to uniquely identify about 87% (216 million of 248 million) of the population in the United States. However, {month and year of birth, gender, 5-digit ZIP} are likely to uniquely identify 3.7% of the U.S. population. And, {year of birth, gender, 5-digit ZIP} are likely to uniquely identify 0.04% (or about 105,016 people) of the U.S. population. In terms of more sensitive states, 0.89% (or 5703 people) of the population of Iowa is likely to be uniquely identified by {year of birth, gender, 5-digit ZIP}.

Conclusion

Data anonymity is a newly emerging area of computer science that has already produced viable solutions to alleviate some privacy concerns when releasing data. The goal of data anonymity is to provide computational solutions for releasing data that are practically useful while providing scientific guarantees that the identities of the individuals who are the subjects of the data are protected. Available solutions are robust, inexpensive and effective. They can be used to provide health information to researchers while still protecting privacy. Resulting data releases are generally superior to limited data sets because data released by computational solutions tend to have many more details (making the release more useful) while maintaining scientific guarantees of privacy (providing better protection).


References

Hundepool, A. and Willenborg, L., "Mu-Argus and Tau-Argus: Software for Statistical Disclosure Control," Third International Seminar on Statistical Confidentiality, 1996.

Malin, B. and Sweeney, L., Determining the Identifiability of DNA Database Entries. Proceedings AMIA Symp 2000. Nov 2000; 537-541.

Malin, B. and Sweeney, L., Re-Identification of DNA through an Automated Linkage Process. Proc AMIA Symp 2001.. Nov 2001; 423-427.

Sweeney, L., Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine and Ethics. 1997, 25:98-110.

Sweeney, L., Replacing Personally-Identifying Information in Medical Records, the Scrub System. In: Cimino, JJ, ed. Proceedings, Journal of the American Medical Informatics Association. Washington, DC: Hanley & Belfus, Inc., 1996:333-337.


Latanya Sweeney, Ph.D., 4/26/2002