Many libraries, archives, and museums are now regularly acquiring, processing, and analyzing born-digital materials. Materials exist on a variety of source media, including flash drives, hard drives, floppy disks, and optical media. Extracting disk images (i.e., sector-by-sector copies of digital media) is an increasingly common practice. It can be essential to ensuring provenance, original order, and chain of custody. Disk images allow users to explore and interact with the original data without risk of permanent alteration. These replicas help institutions to safeguard against modifications to underlying data that can occur when a file system contained on a storage medium is mounted, or a bootable medium is powered up. Retention of disk images can substantially reduce preservation risks. Digital storage media become progressively difficult (or impossible) to read over time, due to “bit rot,” obsolescence of media, and reduced availability of devices to read them. Simply copying the allocated files off a disk and discarding the storage carrier, however, can be problematic. The ability to access and render the content of files can depend upon the presence of other data that resided on the disk. These dependencies are often not obvious upon first inspection and may only be discovered after the original medium is no longer readable or available. Disk images also enable a wide range of potential access approaches, including dynamic browsing of disk images (Misra S, Lee CA, Woods K (2014) A Web Service for File-Level Access to Disk Images. Code4Lib Journal, 25 ) and emulation of earlier computing platforms. Disk images often contain residual data, which may consist of previously hidden or deleted files (Redwine G, et al. in Born digital: guidance for donors, dealers, and archival repositories. Council on Library and Information Resources, Washington, 2013 ). Residual data can be valuable for scholars interested in learning about the context of creation. Traces of activities undertaken in the original environment—for example, identifying removable media connected to a host machine or finding contents of browser caches—can provide additional sources of information for researchers and facilitate the preservation of materials (Woods K, et al. in Proceedings of the 11th annual international ACM/IEEE joint conference on digital libraries, pp. 57–66, 2011 ). Digital forensic tools can be used to create disk images in a wide range of formats. These include raw files (such as those produced by the Unix tool dd). Quantifying successes and failures for many tools can require judgment calls by qualified digital curation professionals. Verifying a checksum for a file is a simple case; the checksums either match or are different. In the events described in the previous sections, however, the conditions for success are fuzzier. For example, fiwalk will often “successfully” complete whether or not it is able to extract a meaningful record of the contents of file system(s) on a disk image. Likewise, bulk_extractor will simply report items of interest it has discovered. Knowing whether this output is useful (and whether it has changed between separate executions of a given tool) depends on comparison of the output between the two runs, information not currently recorded in the PREMIS document. In the BitCurator implementation, events are often recorded as having completed, rather than as having succeeded, to avoid ambiguity. Future iterations of the implementation may include more nuanced descriptions of event outcomes.
Software is a critical part of modern research and yet there is little support across the scholarly ecosystem for its acknowledgement and citation. Inspired by the activities of the FORCE11 working group focused on data citation, this document summarizes the recommendations of the FORCE11 Software Citation Working Group and its activities between June 2015 and April 2016. Based on a review of existing community practices, the goal of the working group was to produce a consolidated set of citation principles that may encourage broad adoption of a consistent policy for software citation across disciplines and venues. Our work is presented here as a set of software citation principles, a discussion of the motivations for developing the principles, reviews of existing community practice, and a discussion of the requirements these principles would place upon different stakeholders. Working examples and possible technical solutions for how these principles can be implemented will be discussed in a separate paper.
This comment is informed by research with collaborators through the Privacy Tools for Sharing Research Data project at Harvard University. In this broad, multidisciplinary 1 project, we are exploring the privacy issues that arise when collecting, analyzing, and disseminating research datasets containing personal information. Our efforts are focused on translating the theoretical promise of new measures for privacy protection and data utility into practical tools and approaches. In particular, our work aims to help realize the tremendous potential from social science research data by making it easier for researchers to share their data using privacy protective tools.
In general, the growth of big data sources have changed the threat landscape of privacy and statistics in at least three major ways. First, when surveys were initially founded as the principal source of statistical information, whether one participated in a survey was largely unknown. Now, as government record systems and corporate big data sources are increasingly used that include all or a large portion of a given universe, that privacy protection is eroded. Second, in the past, little outside information was generally available to match with published summaries. Now the ubiquity of auxiliary information enables many more inferences from summary data. Third, in the past, typical privacy attacks relied on linking outside data through well-known public characteristics -- PII or BII. Now, datasets can be linked through behavioral fingerprints.
The current state of the practice in privacy lags well behind the state of the art in this area. Most commercial organizations, and most NSOs in other countries continue to rely (at most) on traditional aggregation and suppression methods to protect privacy – with no formal analysis of privacy loss or of the utility of the information gathered. The U.S. Census Bureau, because of its size, institutional capacity, and strong reputation for privacy protection could establish leadership in modernizing privacy practices.
Vast quantities of data about individuals are increasingly being created by services such as mobile apps and online social networks and through methods such as DNA sequencing. These data are quite rich, containing a large number of fine-grained data points related to human biology, characteristics, behaviors, and relationships over time.
We recognize the exciting research opportunities enabled by new data sources and technologies for collecting, analyzing, and sharing data about individuals. With the ability to collect and analyze massive quantities of data related to human characteristics, behaviors, and interactions, researchers are increasingly able to explore phenomena in finer detail and with greater confidence. At the same time, a 2 major challenge for realizing the full potential of these recent advances will be protecting the privacy of human subjects. Approaches to privacy protection in common use in both research and industry contexts often provide limited realworld privacy protection. We believe institutional review boards (IRBs) and investigators require new guidance to inform their selection and implementation of appropriate measures for privacy protection in human subjects research. Therefore, we share many of the same concerns and rec
The claims and protests caused by the deterioration of the political elite during the last decade show, among other things, the urgency to strengthen the linkage between citizens and their representatives. From our perspective, the use of information technology, as well as the generation and use of open data, offers an opportunity to improve the levels of governance and democratic consolidation in Mexico. In this area, the delimitation of the electoral boundaries is key to improve political representation. Given the technicalities surrounding boundary delimitation processes –geographical, statistical, informatics, among the most recognizable– it is easy to fall into the temptation of relegating redistricting to specialists and lose sight of its importance for democracy. In this paper we discuss how new technologies can be useful to incorporate the design, analysis, and study of electoral cartography to the international standards of open government. Additionally, we describe how an open source web-based platform, available to any citizen, has great potential for increasing the levels of participation, transparency, communication, and accountability surrounding redistricting process in the country.