Researchers are increasingly obtaining data from social networking websites, publicly-placed sensors, government records and other public sources. Much of this information appears public, at least to first impressions, and it is capable of being used in research for a wide variety of purposes with seemingly minimal legal restrictions. The insights about human behaviors we may gain from research that uses this data are promising. However, members of the research community are questioning the ethics of these practices, and at the heart of the matter are some difficult questions about the boundaries between public and private information. This workshop report, the second in a series, identifies selected questions and explores issues around the meaning of “public” in the context of using data about individuals for research purposes.
In general, the growth of big data sources have changed the threat landscape of privacy and statistics in at least three major ways. First, when surveys were initially founded as the principal source of statistical information, whether one participated in a survey was largely unknown. Now, as government record systems and corporate big data sources are increasingly used that include all or a large portion of a given universe, that privacy protection is eroded. Second, in the past, little outside information was generally available to match with published summaries. Now the ubiquity of auxiliary information enables many more inferences from summary data. Third, in the past, typical privacy attacks relied on linking outside data through well-known public characteristics -- PII or BII. Now, datasets can be linked through behavioral fingerprints.
The current state of the practice in privacy lags well behind the state of the art in this area. Most commercial organizations, and most NSOs in other countries continue to rely (at most) on traditional aggregation and suppression methods to protect privacy – with no formal analysis of privacy loss or of the utility of the information gathered. The U.S. Census Bureau, because of its size, institutional capacity, and strong reputation for privacy protection could establish leadership in modernizing privacy practices.
Vast quantities of data about individuals are increasingly being created by services such as mobile apps and online social networks and through methods such as DNA sequencing. These data are quite rich, containing a large number of fine-grained data points related to human biology, characteristics, behaviors, and relationships over time.
We recognize the exciting research opportunities enabled by new data sources and technologies for collecting, analyzing, and sharing data about individuals. With the ability to collect and analyze massive quantities of data related to human characteristics, behaviors, and interactions, researchers are increasingly able to explore phenomena in finer detail and with greater confidence. At the same time, a 2 major challenge for realizing the full potential of these recent advances will be protecting the privacy of human subjects. Approaches to privacy protection in common use in both research and industry contexts often provide limited realworld privacy protection. We believe institutional review boards (IRBs) and investigators require new guidance to inform their selection and implementation of appropriate measures for privacy protection in human subjects research. Therefore, we share many of the same concerns and rec
On September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data. This workshop report, the first in a series, provides an overview of the long-term longitudinal study use case. Long-term longitudinal studies collect, at multiple points over a long period of time, highly-specific and often sensitive data describing the health, socioeconomic, or behavioral characteristics of human subjects. The value of such studies lies in part in their ability to link a set of behaviors and changes to each individual, but these factors tend to make the combination of observable characteristics associated with each subject unique and potentially identifiable. Using the research information lifecycle as a framework, this report discusses the defining features of long-term longitudinal studies and the associated challenges for researchers tasked with collecting and analyzing such data while protecting the privacy of human subjects. It also describes the disclosure risks and common legal and technical approaches currently used to manage confidentiality in longitudinal data. Finally, it identifies urgent problems and areas for future research to advance the integration of various methods for preserving confidentiality in research data.