Increasingly, governments and businesses are collecting, analyzing, and sharing detailed information about individuals over long periods of time. Vast quantities of data from new sources and novel methods for large-scale data analysis promise to yield deeper understanding of human characteristics, behavior, and relationships and advance the state of science, public policy, and innovation. At the same time, the collection and use of fine-grained personal data over time is associated with significant risks to individuals, groups, and society at large. In this article, we examine a range of longterm data collections, conducted by researchers in social science, in order to identify the characteristics of these programs that drive their unique sets of risks and benefits. We also examine the practices that have been established by social scientists to protect the privacy of data subjects in light of the challenges presented in long-term studies. We argue that many uses of big data, across academic, government, and industry settings, have characteristics similar to those of traditional long-term research studies. In this article, we discuss the lessons that can be learned from longstanding data management practices in research and potentially applied in the context of newly emerging data sources and uses.
Differential privacy is a formal mathematical formal mathematical framework for guaranteeing privacy protection when analyzing or releasing statistical data. Recently emerging from the theoretical computer science literature, differential privacy is now in initial stages of implementation and use in various academic, industry, and government settings.
This document is a primer on differential privacy. Using intuitive illustrations and limited mathematical formalism, this primer provides an introduction to dierential privacy for non-technical practitioners, who are increasingly tasked with making decisions with respect to dierential privacy as it grows more widespread in use. In particular, the examples in this document illustrate ways in which social science and legal audiences can conceptualize the guarantees provided by differetial privacy with respect to the decisions they make when managing personal data about research subjects and informing them about the privacy protection they will be afforded.
This project explores how multidimensional bio-psychological measures are used to understand the cognitive aspects of student learning in STEM (Science, Technology, Engineering and Math) focused educational games. Furthermore, we seek to articulate a method for how learning events can be automatically analyzed using these tools. Given the complexity and difficulty of finding externalized markers of learning as it happens, it is evident that more robust measures could benefit this process. The work reported here, with funding from National Science Foundation grant (NSF DRL-1417456), aims to incorporate more diverse measures of behavior and physiology in order to create a more complete assessment of learning and cognition in a game based environment. Tools used in this project include eye tracking systems, heart rate sensors, as well as tools for detecting electrodermal activity (EDA), temperature and movement data. Findings indicated both the utility of more varied measures as well as the need for more precise tools for synchronization of diverse data streams.
Researchers are increasingly obtaining data from social networking websites, publicly-placed sensors, government records and other public sources. Much of this information appears public, at least to first impressions, and it is capable of being used in research for a wide variety of purposes with seemingly minimal legal restrictions. The insights about human behaviors we may gain from research that uses this data are promising. However, members of the research community are questioning the ethics of these practices, and at the heart of the matter are some difficult questions about the boundaries between public and private information. This workshop report, the second in a series, identifies selected questions and explores issues around the meaning of “public” in the context of using data about individuals for research purposes.
Gerrymandering requires illicit intent. We classify six proposed methods to infer the intent of a redistricting authority using a formal framework for causal inferences that encompasses the redistricting process from the release of census data to the adoption of a final plan. We argue all proposed techniques to detect gerrymandering can be classified within this formal framework. Courts have, at one time or another, weighed evidence using one or more of these methods to assess racial or partisan gerrymandering claims. We describe the assumptions underlying each method, raising some heretofore unarticulated critiques revealed by laying bare their assumptions. We then review how these methods were employed in the 2014 Florida district court ruling that the state legislature violated a state constitutional prohibition on partisan gerrymandering, and propose standards that advocacy groups and courts can impose upon redistricting authorities to ensure they are held accountable if they adopt a partisan gerrymander.
In the U.S. redistricting is deeply politicized and often synonymous with gerrymandering -- the manipulation of boundaries to promote the goals of parties, incumbents, and racial groups. In contrast, Mexico’s federal redistricting has been implemented nationwide since 1996 through automated algorithms devised by the electoral management body (EMB) in consultation with political parties. In this setting, parties interact strategically and generate counterproposals to the algorithmically generated plans in a closed-door process that is not revealed outside the bureaucracy. Applying geospatial statistics and large-scale optimization to a novel dataset that has never been available outside of the EMB, we analyze the effects of automated redistricting and partisan strategic interaction on representation. Our dataset comprises the entire set of plans generated by the automated algorithm, as well as all the counterproposals made by each political party during the 2013 redistricting process. Additionally, we inspect the 2006 map with new data and two proposals to replace it towards 2015 in search for partisan effects and political distortions. Our analysis offers a unique insight into the internal workings of a purportedly autonomous EMB and the partisan effects of automated redistricting on representation.
Wood A, O'Brien D, Altman M, Karr A, Gasser U, Bar-Sinai M, Nissim K, Ullman J, Vadhan S, Wojcik MJ.
On September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data.
This workshop report, the first in a series, provides an overview of the long-term longitudinal study use case. Long-term longitudinal studies collect, at multiple points over a long period of time, highly-specific and often sensitive data describing the health, socioeconomic, or behavioral characteristics of human subjects. The value of such studies lies in part in their ability to link a set of behaviors and changes to each individual, but these factors tend to make the combination of observable characteristics associated with each subject unique and potentially identifiable.
Using the research information lifecycle as a framework, this report discusses the defining features of long-term longitudinal studies and the associated challenges for researchers tasked with collecting and analyzing such data while protecting the privacy of human subjects. It also describes the disclosure risks and common legal and technical approaches currently used to manage confidentiality in longitudinal data. Finally, it identifies urgent problems and areas for future research to advance the integration of various methods for preserving confidentiality in research data.
Gallinger M, Bailey J, Cariani K, Owens T, Altman M.
Research and practice in digital preservation requires a solid foundation of evidence of what is being protected and what practices are being used. The National Digital Stewardship Alliance (NDSA) storage survey provides a rare opportunity to examine the practices of most major US memory institutions. The repeated, longitudinal design of the NDSA storage surveys offer a rare opportunity to more reliably detect trends within and among preservation institutions rather than the typical surveys of digital preservation, which are based on one-time measures and convenience (Internet-based) samples. The survey was conducted in 2011 and in 2013. The results from these surveys have revealed notable trends, including continuity of practice within organizations over time, growth rates of content exceeding predictions, shifts in content availability requirements, and limited adoption of best practices for interval fixity checking and the Trusted Digital Repositories (TDR) checklist. Responses from new memory organizations increased the variety of preservation practice reflected in the survey responses.
In the last decade there has been a dramatic increase in attention from the scholarly communications and research community to open access (OA) and open data practices. These are potentially related, because journal publication policies and practices both signal disciplinary norms, and provide direct incentives for data sharing and citation. However, there is little research evaluating the data policies of OA journals. In this study, we analyze the state of data policies in open access journals, by employing random sampling of the Directory of Open Access Journals (DOAJ) and Open Journal Systems (OJS) journal directories, and applying a coding framework that integrates both previous studies and emerging taxonomies of data sharing and citation. This study, for the first time, reveals both the low prevalence of data sharing policies and practices in OA journals, which differs from the previous studies of commercial journals’ in specific disciplines.
We extend the estimation of the components of partisan biasd, undue advantage conferred to some party in the conversion of votes into legislative seats to single-member district systems in the presence of multiple parties. Extant methods to estimate the contributions to partisan bias from malapportionment, boundary delimitations, and turnout are limited to two-party competition. In order to assess the spatial dimension of multi-party elections, we propose an empirical procedure combining three existing approaches: a separation method (Grofman et al. 1997), a multi-party estimation method (King 1990), and Monte Carlo simulations of national elections (Linzer, 2012). We apply the proposed method to the study of recent national lower chamber elections in Mexico. Analysis uncovers systematic turnout-based bias in favor of the former hegemonic ruling party that has been offset by district geography substantively helping one or both other major parties.
We analyze sixty-six Ohio congressional plans produced during the post-2010 census redistricting by the legislature and the public. The public drew many plans submitted for judging in a competition hosted by reform advocates, who awarded a prize to the plan that scored best on a formula composed of four permissive components: compactness, respect for local political boundaries, partisan fairness, and competition. We evaluate how the legislature’s adopted plan compares to these plans on the advocates’ criteria and our alternative set of criteria, which reveals the degree by which the legislature placed partisanship over these other criteria. Our evaluation reveals minimal trade-offs among the components of the overall competition’s scoring criteria, but we caution that the scoring formula may be sensitive to implementation choices among its components. Compared to the legislature’s plan, the reform community can get more of the four criteria they value; importantly, without sacrificing the state’s only African-American opportunity congressional district.
Vast quantities of data about individuals are increasingly being created by services such as mobile apps and online social networks and through methods such as DNA sequencing. These data are quite rich, containing a large number of fine-grained data points related to human biology, characteristics, behaviors, and relationships over time.
. Washington, DC: Department of Health and Human Services; 2016.
This comment is informed by research with collaborators through the Privacy Tools for Sharing Research Data project at Harvard University. In this broad, multidisciplinary 1 project, we are exploring the privacy issues that arise when collecting, analyzing, and disseminating research datasets containing personal information. Our efforts are focused on translating the theoretical promise of new measures for privacy protection and data utility into practical tools and approaches. In particular, our work aims to help realize the tremendous potential from social science research data by making it easier for researchers to share their data using privacy protective tools.
. Social Science Research Network [Internet]. 2016.
In general, the growth of big data sources have changed the threat landscape of privacy and statistics in at least three major ways. First, when surveys were initially founded as the principal source of statistical information, whether one participated in a survey was largely unknown. Now, as government record systems and corporate big data sources are increasingly used that include all or a large portion of a given universe, that privacy protection is eroded. Second, in the past, little outside information was generally available to match with published summaries. Now the ubiquity of auxiliary information enables many more inferences from summary data. Third, in the past, typical privacy attacks relied on linking outside data through well-known public characteristics -- PII or BII. Now, datasets can be linked through behavioral fingerprints. The current state of the practice in privacy lags well behind the state of the art in this area. Most commercial organizations, and most NSOs in other countries continue to rely (at most) on traditional aggregation and suppression methods to protect privacy – with no formal analysis of privacy loss or of the utility of the information gathered. The U.S. Census Bureau, because of its size, institutional capacity, and strong reputation for privacy protection could establish leadership in modernizing privacy practices.
Many libraries, archives, and museums are now regularly acquiring, processing, and analyzing born-digital materials. Materials exist on a variety of source media, including flash drives, hard drives, floppy disks, and optical media. Extracting disk images (i.e., sector-by-sector copies of digital media) is an increasingly common practice. It can be essential to ensuring provenance, original order, and chain of custody. Disk images allow users to explore and interact with the original data without risk of permanent alteration. These replicas help institutions to safeguard against modifications to underlying data that can occur when a file system contained on a storage medium is mounted, or a bootable medium is powered up. Retention of disk images can substantially reduce preservation risks. Digital storage media become progressively difficult (or impossible) to read over time, due to “bit rot,” obsolescence of media, and reduced availability of devices to read them. Simply copying the allocated files off a disk and discarding the storage carrier, however, can be problematic. The ability to access and render the content of files can depend upon the presence of other data that resided on the disk. These dependencies are often not obvious upon first inspection and may only be discovered after the original medium is no longer readable or available. Disk images also enable a wide range of potential access approaches, including dynamic browsing of disk images (Misra S, Lee CA, Woods K (2014) A Web Service for File-Level Access to Disk Images. Code4Lib Journal, 25 ) and emulation of earlier computing platforms. Disk images often contain residual data, which may consist of previously hidden or deleted files (Redwine G, et al. in Born digital: guidance for donors, dealers, and archival repositories. Council on Library and Information Resources, Washington, 2013 ). Residual data can be valuable for scholars interested in learning about the context of creation. Traces of activities undertaken in the original environment—for example, identifying removable media connected to a host machine or finding contents of browser caches—can provide additional sources of information for researchers and facilitate the preservation of materials (Woods K, et al. in Proceedings of the 11th annual international ACM/IEEE joint conference on digital libraries, pp. 57–66, 2011 ). Digital forensic tools can be used to create disk images in a wide range of formats. These include raw files (such as those produced by the Unix tool dd). Quantifying successes and failures for many tools can require judgment calls by qualified digital curation professionals. Verifying a checksum for a file is a simple case; the checksums either match or are different. In the events described in the previous sections, however, the conditions for success are fuzzier. For example, fiwalk will often “successfully” complete whether or not it is able to extract a meaningful record of the contents of file system(s) on a disk image. Likewise, bulk_extractor will simply report items of interest it has discovered. Knowing whether this output is useful (and whether it has changed between separate executions of a given tool) depends on comparison of the output between the two runs, information not currently recorded in the PREMIS document. In the BitCurator implementation, events are often recorded as having completed, rather than as having succeeded, to avoid ambiguity. Future iterations of the implementation may include more nuanced descriptions of event outcomes.
Software is a critical part of modern research and yet there is little support across the scholarly ecosystem for its acknowledgement and citation. Inspired by the activities of the FORCE11 working group focused on data citation, this document summarizes the recommendations of the FORCE11 Software Citation Working Group and its activities between June 2015 and April 2016. Based on a review of existing community practices, the goal of the working group was to produce a consolidated set of citation principles that may encourage broad adoption of a consistent policy for software citation across disciplines and venues. Our work is presented here as a set of software citation principles, a discussion of the motivations for developing the principles, reviews of existing community practice, and a discussion of the requirements these principles would place upon different stakeholders. Working examples and possible technical solutions for how these principles can be implemented will be discussed in a separate paper.
This is a Comment on the Department of Health and Human Services (HHS) Proposed Rule: Federal Policy for the Protection of Human Subjects
We recognize the exciting research opportunities enabled by new data sources and technologies for collecting, analyzing, and sharing data about individuals. With the ability to collect and analyze massive quantities of data related to human characteristics, behaviors, and interactions, researchers are increasingly able to explore phenomena in finer detail and with greater confidence. At the same time, a 2 major challenge for realizing the full potential of these recent advances will be protecting the privacy of human subjects. Approaches to privacy protection in common use in both research and industry contexts often provide limited realworld privacy protection. We believe institutional review boards (IRBs) and investigators require new guidance to inform their selection and implementation of appropriate measures for privacy protection in human subjects research. Therefore, we share many of the same concerns and rec
The claims and protests caused by the deterioration of the political elite during the last decade show, among other things, the urgency to strengthen the linkage between citizens and their representatives. From our perspective, the use of information technology, as well as the generation and use of open data, offers an opportunity to improve the levels of governance and democratic consolidation in Mexico. In this area, the delimitation of the electoral boundaries is key to improve political representation. Given the technicalities surrounding boundary delimitation processes –geographical, statistical, informatics, among the most recognizable– it is easy to fall into the temptation of relegating redistricting to specialists and lose sight of its importance for democracy. In this paper we discuss how new technologies can be useful to incorporate the design, analysis, and study of electoral cartography to the international standards of open government. Additionally, we describe how an open source web-based platform, available to any citizen, has great potential for increasing the levels of participation, transparency, communication, and accountability surrounding redistricting process in the country.