Blog

Nov 15, 1:07pm

Amazon recently announced integration of their core S3 service with their low-cost storage system, Glacier. This facilitates the ability to add rules to S3 (or their reduced redundancy store) based on age, date, and S3 bucket prefix.

Regular incremental improvement and integration is a signature of Amazon’s modus operandi for its cloud services: Amazon has a pattern of announcing updates every few weeks that add services, integrate existing services, or (occasionally) lower prices. And they have introduced incremental improvements to the AWS platform over a dozen times since Glacier was announced at the end of August.

Interestingly, Glacier is an apt metaphor for this low-cost service in that it not only signifies “cold” storage, but also signals a massive object making its way slowly but inexorably across a space, covering everything in its path.

Why Glacier is Important

Why is Glacier important? First, as James Hamilton (disclosure, James is VP and Distinguished Engineer at Amazon) aptly summarizes, Glacier provides the volume economics for multi-site, replicated, cold storage (near-line) to small-scale and medium-scale users. While do-it-yourself solutions based on automated tape libraries can still beat Glacier’s price by a huge margin, the sweet spot for this approach has been shifted out so that only very large enterprises are likely to beat the price on Glacier by rolling out their own solutions using tape libraries, etc.

Second, many businesses, and also many services, are built upon or backed up through AWS and S3. Amazon’s continued integration of Glacier into AWS will make it increasingly straightforward to integrate low-cost cold-storage replication into preservation services such as DuraCloud, backup services such as Zmanda, and even into simple software tools like Cyberduck.

Overall, I’m optimistic that this is a Good Thing, and will improve the likelihood of meaningful future access to digital content. However, there are a number of substantial issues to keep in mind when considering Glacier as part of a digital preservation solution.

Issue 1. Technical infrastructure does not guarantee long term durability

Although some commenters have stated that Glacier will “probably outlive us all“, these claims are based on little evidence. The durability of institutions and services relies as much upon economic models, business models, organizational models, and organizational mission as upon technology. Based on the history of technology companies, one must consider that there is a substantial probability that Amazon itself will not be in existence in fifty years, and the future existence of any specific Amazon service is even more doubtful.

Issue 2. Lock-in and future cost projections

As Wired dramatically illustrated, the costs of retrieving all of one’s data from Glacier can be quite substantial. Further, as David Rosenthal has repeatedly pointed out, the long-term cost-competitiveness of preservation services depends “not on their initial pricing, but on how closely their pricing tracks the Kryder’s Law decrease in storage media costs”. And that “It is anyone’s guess how quickly Amazon will drop Glacier’s prices as the underlying storage media costs drop.” The importance of this future price-uncertainty is magnified by the degree of lock-in exhibited by the Glacier service.

Issue 3. Correlated failures

Amazon claims a ‘design reliability’ of ‘99.999999999%‘. This appears to be an extremely optimistic number without any formal published analysis backing it. The number appears to be based on a projection of theoretical failure rates for storage hardware (and such rates are  wildly optimistic under production conditions), together with the (unrealistic) assumption that all such failures are statistically independent.  Moreover, this ‘design reliability’ claim is unsupported (at time of writing) by Glacier’s terms of service, SLA, or customer agreement. To the contrary, the agreements appear to indemnify Amazon against any loss of damage, does not appear to offer a separate SLA for Glacier, and limits recovery under existing SLA’s (for services, such as S3) to refund of fees for periods the service was unavailable. If Amazon were highly confident in the applicability of the quoted ‘design reliability’ to production settings, one might expect a stronger SLA. However, despite these caveats, my guess is that Glacier’s will still turn out to be, in practice, substantially more reliable than the DIY solutions that most individual organizations can afford to implement entirely in-house.

Nevertheless, as previously discussed (most recently at Digital Preservation 2012), a large part of risk mitigation for digital assets is to diversify against sources of correlated failure. Although implementation details are not complete, Glacier does appear to diversify against some common risks to bits — primarily media failure, hardware failure, and localized natural disaster (such as fire, flood). This is good, but far from complete. A number of likely single-point (or highly correlated) vulnerabilities remain, including software failure (e.g. a bug in the AWS software for its control backplane might result in permanent loss that would go undetected for a substantial time; or cause other cascading failures — analogous to those we’ve seen previously); legal threats (leading to account lock-out — such as this, deletion, or content removal); or other institutional threats (such as a change in Amazon’s business model). It is critical that diversification against these additional failures be incorporated into a digital preservation strategy.

Preliminary Recommendations

To sum up, Glacier is an important service, and appears to be a solid option for cold storage, but institutions that are responsible for digital preservation and long-term access should not use the quoted design reliability in modeling likelihood of loss, nor rely on Glacier as the sole archival mechanism for their content.


Nov 14, 5:04pm

Lately, our DistrictBuilder software, a tool that allows people to easily participate in creating election districts, has gotten some additional attention. We recently received an Outstanding Software Development Award from the American Political Science Association (given by the Information Technology & Politics Section) and a Data Innovation Award given by the O’Reilly Strata Conference (for data with social impact). And just last week, we had the opportunity to present our work to the government of Mexico at the invitation of the Instituto Federal Electoral, as part of their International Colloquium on Redistricting.

During this presentation, I was able to reflect on the interplay of algorithms and public participation. and it became even clearer to me that applications like DistrictBuilder exemplify the ability of information science to improve policy and politics.

Redistricting in Mexico is particularly interesting, since it relies heavily on facially neutral geo-demographic criteria and optimization algorithms, which represents a different sort of contribution from information science. Thus, it was particularly interesting to me to consider the interplay between algorithmic approaches to problem solving and “wisdom of crowd” approaches, especially for problems in the public sphere.

It’s clear that complex optimization algorithms are an advance in redistricting in Mexico, and have an important role in public policy. However, they also have a number of limitations:

  • Algorithmic optimization solutions often depend on a choice of (theoretically arbitrary) ‘starting values’ from which the algorithm starts its search for a solution.
  • Quality algorithmic solutions typically rely on accurate input data.
  • Many optimization algorithms embed particular criteria or particular constraints into the algorithm itself.
  • Even where optimization algorithms are nominally agnostic to the criteria used for the goal, some criteria are more tractable than others; and some are more tractable for particular algorithms.
  • In many cases, when an algorithm yields a solution, we don’t know exactly (or even approximately, in any formal sense) how good that solution is.

I argue that explicitly incorporating a human element is important for algorithmic solutions in the public sphere. In particular:

  • Use open documentation and open (non-patented, or open-licensed) to enable external replication of algorithms.
  • Use open source to enable external verification of the implementation of particular algorithms.
  • Incorporate public input to improve the data (especially describing local communities and circumstances) in algorithm driven policies.
  • Incorporate crowd-sourced solutions as candidate “starting values” for further algorithmic refinement.
  • Subject algorithmic output to crowd-sourced public review to verify the quality of the solutions produced.

You can see the slides, which include more detail and references below. For much such slides, refer to our PublicMapping project site.


Oct 25, 5:40pm

The workshop report from the UNC Curating for Data Quality workshop, in which I was delighted to participate, is now being made available. It contains many perspectives addressing a number of questions:

Data Quality Criteria and Contexts. What are the characteristics of data quality? What threats to data quality arise at different stages of the data life cycle? What kinds of work processes affect data quality? What elements of the curatiorial process most strongly affect data quality over time? How do data types and contexts influence data quality parameters?

Human and Institutional Factors. What are the costs associated with different levels of data quality? What kinds of incentives and constraints influence efforts of different stakeholders? How does one estimate the continuum from critical to tolerable errors? How often does one need to validate data?

Tools for Effective and Painless Curation. What kinds of tools and techniques exist or are required to insure that creators and curators address data quality?

Metrics. What are or should be the measures of data quality? How does one identify errors? How does one correct errors or mitigate their effects?

My current perspective, after reflecting on seven ‘quality’ frameworks from different disciplines that differ in complex and deep ways, is that the data quality criteria implied by the candidate frameworks are neither easily harmonized, nor readily quantified. Thus, a generalized systematic approach to evaluating data quality seems unlikely to emerge soon. Fortunately, developing an effective approach to digital curation that respects data quality does not require a comprehensive definition of data quality. Instead, we can appropriately address “data quality” in curation by limiting our consideration to a narrower applied questions:

Which aspects of data quality are (potentially) affected by (each stage of) digital curation activity? And how do we keep invariant data quality properties at each curation stage?

A number of approaches suggest seem particularly likely to bear fruit:

  1. Incorporate portfolio diversification in selection and appraisal.
  2. Support validation of preservation quality attributes such as authenticity, integrity, organization, and chain of custody throughout long-term preservation and use — from ingest through delivery and creation of derivative works.
  3. Apply semantic fingerprints for quality evaluation during ingest, format migration and delivery.

These approaches have the advantage of being independent of the content subject area, of the domain of measure, and of the particular semantics content of objects and collections. Therefore, they are broadly applicable. By mitigating these broad-spectrum threats to quality, we can improve the overall quality of curated collections, and their expected value to target communities.

My extended thoughts are here:

You may also be interested in the other presentations from the workshop, which are posted on the Conference Site.


Oct 23, 12:37pm

I was pleased to participate in the 2012 PLN Community Meeting.

Over the last decade, replication has become a required practice for digital preservation. Now, Distributed Digital Preservation (DDP) networks are emerging as a vital strategy to ensure long-term access to the scientific evidence base and cultural heritage. A number of DDP networks are currently in production, including CLOCKSS, Data-PASS, MetaArchive, COPPUL, Lukll, PeDALS, Synergies, Data One, and new networks, such as DFC and DPN are being developed.

These networks were created to mitigate the risk of content loss by diversifying across software architectures, organizational structures, geographic regions, as well as legal, political, and economic environments. And many of these networks have been successful at replicating a diverse set of content.

However, the point of the replication enterprise is recovery. Archival recovery is an even harder problem because one needs to validate not only that a set of objects is recoverable, but also that the collection recovered also contains sufficient metadata and contextual information to remain interpretable! A demonstration of the difficulty of this was the AIHT exercise sponsored by the Library of Congress, which demonstrated that many collections thought to be substantially “complete” could not be successfully re-ingested (i.e. recovered) by another archive, even in the absence of bit-level failures.

In a presentation co-authored with Jonathan Crabtree, we summarized some lessons learned from trial audits of several production distributed digital preservation networks. These audits were conducted using the open source SafeArchive system (www.safearchive.org), which enables automated auditing of a selection of TRAC criteria related to replication and storage. An analysis of the trial audits demonstrates both the complexities of auditing modern replicated storage networks, and reveals common gaps between archival policy and practice. It also reveals gaps in the auditing tools we have available. Our presentation, below, focused on the importance of designing auditing systems to provide diagnostic information that can be used to diagnose non-confirmations of audited policies. Tom Lipkis followed with specific planned and possible extensions in LOCKSS that would enhance diagnosis and auditing.

You may also be interested in the other presentations from the workshop, which are posted on the PLN2012 Website.


Sep 27, 2:27pm

I was pleased to participate in the Niso Forum on Tracking it Back to the Source: Managing and Citing Research Data. 

A principled approach to data management involves modeling information through the lifecycle to assess stakeholder requirements at each stage, and then tracking management, use and impact of that information.

One of the complexities that lifecycle modeling reveals is the variety of different goals that are associated with data management – including orchestrating data for current use; protecting against disclosure; complying with contracts, regulation, law and policy; maximizing the overall value of held information assets; and ensuring short and long-term dissemination.

The most challenging aspects of data management are often associated with management across stages and among different actors. A number of tools and methods provide leverage, including:

  • Identifier systems – identification of information objects and actors, and structured use of these identifiers (identifiers, references, citations).
  • Metadata and tools (such as RCS or VCS) for tracking provenance — the relationship of delivered data to the history of inputs and modifications, and the actors responsible for these.
  • Systems and methods for validating authenticity and chain of custody — assertions about the provenance and ownership (respectively) of information.
  • Systems and methods for auditing — verification of asserted system properties, and policy compliance

The presentation further examines data management and data citation from an information lifecycle approach:

You may also be interested in the other presentations from the workshop, which will soon be posted on the NISO Forum website.


Jul 31, 11:10am

My colleagues from the National Digital Stewardship Alliance working group on Infrastructure and I were pleased to lead a session on bit-level preservation at the 2012 annual Digital Preservation conference, hosted by the Library of congress.

Bit-Level preservation is far from a solved problem. While we know roughly what transformations and processes can be used to mitigate risk from major threats, there is a considerable (and largely applied) research to be done to determine optimal/cost-effective levels  and strategies for  replication, diversification, compression, auditing; and to develop better (more reliable & valid) measures of risk.

The talk summarized the major risk factors and mitigation strategies, and noted some inter-relationships:

You may also be interested in other presentations from the conference. Bill LeFurgy has an informative blog post with highlights.

David Weinberger’s talk was particularly provocative, drawing on themes from his recent book Too Big to Know.  He claims, essentially, that the increase in data, and even more, the networking of information and people, changes the nature of knowledge itself. Knowledge has constituted a series of stopping points (authoritative source, authoritative texts);  a corresponding set of institutions and practices to “filter out” bad information; and a physical representation constrained by the form, length, and relative unalterability of printed books. Now, Weinberger claims knowledge is increasing a profess of filtering  forward — of provide summaries and links to a much larger and more dynamic knowledge base. This redefines knowledge, changes the role of institutions (which cannot hope to contain all knowledge in an area), and implies that (a) filters are content; (b) we are forced into awareness of the contingent and limited nature of filters — there is always bad information and contradictory information available. Changes in knowledge also changes the nature of expertise and science — both becoming less hierarchical, more diverse, more linked.

If Weinberger is right, and I suspect he is (in large part), there are undiscussed implications for digital preservation. First, our estimates of the expected value of long-term access should be going up if the overall value of knowledge is increased by the total context of knowledge available. Second, we need to go beyond preserving individual information objects, or even “complete” collections — value resides in the network as a whole, and in the filters being used. Maintaining our cultural heritage and scientific evidence base requires enabling historic access to this dynamic network of informations.


Jun 11, 3:25pm

Since knowledge is not a private good, a pure market approach leads to under-provisioning. Planning for access to the scholarly record should include planning for long-term access beyond the life of a single institution. Important problems in scholarly communications, information science & scholarship increasingly require diverse multidisciplinary approaches.
My colleagues Lynne Herndon, Amy Brand and I were honored to be able to discuss the future of scholarly communication at  Georgetown University’s annual Scholarly Communication Symposium.

The video is below:

My slides are also available:


Mar 12, 10:08pm

NIH posted a request for information. The following is a response developed in collaboration with the Data-PASS partner organizations.

Response to request for Information

Dr. Micah Altman
Director of Research — MIT Libraries, Massachusetts Institute of Technologies
Head/Scientist, Program for Information Science
Non-Resident Senior Fellow, The Brookings Institution

Libbie Stephenson
Distinguished Librarian
Director, UCLA Social Science Data Archive

Writing on behalf of the Data Preservation Alliance for Social Sciences (http://data-pass.org)

INTRODUCTION

Thank you for the opportunity to submit comments for input into the deliberation of the committee.

The Data Preservation Alliance for the Social Sciences (http://Data-PASS.org) is a broad-based voluntary partnership of data archives dedicated to acquiring, cataloging and preserving social science data, and to developing and advocating best practices in digital preservation. The partners collaborate to acquire data at the risk of being lost to the research community; to develop preservation and data sharing practices; and to create open infrastructure for collaborative data indexing, sharing, and preservation.

Collectively, the founding partners have over 200 years of combined experience in social science data sharing. These partners include the Inter-university Consortium for Political and Social Research, The Roper Center for Public Opinion Research, The Howard W. Odum Institute for Research in Social Science, the Electronic Records Section in the National Archives and Records Administration’s Research Services – Archival Operations, Washington, DC (RD-DC), the Institute for Quantitative Social Sciences at Harvard University (which contains both the Harvard-MIT Data Center and the Henry A. Murray Archive), and the Social Science Data Archive at the University of California, Los Angeles (UCLA).

Thus far, the partnership has identified thousands of at-risk research studies (collections of data) and acquired many of these for permanent preservation. These range from data collections created under NSF (National Science Foundation) and NIH (National Institutes of Health) grants, to surveys conducted by private research organizations, to state-level polling data, to data records created by governmental research or administrative programs. [Gutmann, et al, 2009]

A National Digital Stewardship Alliance Founding Member, the Data-PASS partnership
works to archive social science data collections at risk of being lost; to catalog and
promote access to data collections; to establish verifiable multi-institutional
collaborative replication and stewardship of data; and to develop and advocate best
practices in digital preservation.

EMERGING STANDARDS AND PRACTICES FOR SHARING AND MANAGING DATA

The task force may wish to take note of broad-based and thoughtful commentary on data sharing that has emerged from the research community, including the following:

  • The National Science Board’s draft report on Digital Research Data Sharing and Management [NSB 2011], which emphasizes the value of open access to data, and identifies key challenges to promoting wide access.
  • The NRC’s recently report on Communicating Science and Engineering in the Information Age, which develops a number of recommendations, that although directed at NCSES are readily applicable to research data management, publication, and dissemination in general. Specifically, recommendations 3-1, 3-2, 3-3, and 3-4 together represent general good practice for data management and publication. Published results are more reliable when the underlying data is available; and when management of the that data incorporates versioning, open formats and protocols, machine-actionable metadata, and systematic tracking of information provenance and modification from initial data collection through subequent publications. [NRC 2011]
  • Numerous responses to the recent ANPRM on proposed changes to the common rule noted how overreaching and unsophisticated approaches to confidentiality can drastically erode the value of data sharing. Notably, two responses by data privacy and computer science researchers provide a roadmap for simultaneously increasing data sharing and privacy protections by leveraging advances in theoretical computer science, and by establishing mechanisms for accountability and transparency. [Sweeney, et al., 2010; Vadhan, et al. 2010]
  • Numerous responses to the recent OSTP Request for Information on Public Access to Digital Data Resulting from Federally Funded Scientific Research [OSTP 2011], which comment on the benefits of data access, and draw attention to needs, protocols and standards for open data access and interoperability. Notably, the responses of the National Digital Stewardship Alliance, the Data-Preservation Alliance for the Social Sciences, Carnegie Mellon University, the University of California Libraries, and the International Consortium for Political and Social Research reference the need for and successful exemplars of community-based standards for open data dissemination, discovery, and preservation.


DIRECTED RESPONSES

RESEARCH INFORMATION LIFECYCLE


In his work on research data lifecycles, Charles Humphrey [2004] has provided an overview of this process applicable to a variety of research disciplines. “Life cycle models are shaping the way we study digital information processes. These models represent the life course of a larger system, such as the research process, through a series of sequentially related stages or phases in which information is produced or manipulated.“ Further, “well organised, well documented, preserved and shared data are invaluable to advance scientific inquiry and to increase opportunities for learning and innovation.”

At each stage of a research lifecycle, from when the project is first designed, through to data collection, analysis and publication, knowledge about the research and data is created. When the data can be shared for re-use and re-purposing, the relationship among the stages enables linkages among disparate data points to come to new understandings, new conclusions, and new ways of visualizing data relationships. When considering the management, integration, and analysis of large biomedical datasets, the roles and responsibilities of researchers and data management organizations need to be determined. Ideally, this should focus on documenting the stages in the research lifecycle, including:

  • Design of a research project
  • Data collection processes and instruments
  • Data organization in digital format
  • Documentation of data analysis process
  • Publication or sharing of results
  • Dissemination, sharing, and reuse
  • Preservation, long term conservation, and long term access


Bernstein, et al. [2011] suggest that any approach to documenting research data life cycle must take into account gaps in the transfer of information about the data. An overlapping approach that looks toward the short and long-term future should incorporate methods for catching information gaps; “There is no simple way to reach the goal of maintaining high volumes of important data in heterogeneous environments over many decades. Not only must a cross-generational communication system reach from the present to the future half a century from now, but from many points in time in the near future to many points in time in the distant future. No single current medium, no single current file format specification is likely to be sufficiently robust and adaptable to survive without major changes over half a century.”

The best approach is to consider overlapping or collaborative data management solutions. Examples of this approach are reflected in the work of the MaDAM project recently concluded in the UK [Poschen 2010] . This project aimed to develop tools and an infrastructure for life-cycle management of biomedical data. Shahand [2011] and colleagues have addressed research life cycles in bioinformatics and have carefully examined roles and responsibilities “to support a wide spectrum of user profiles, with different expertise and requirements.” Their work demonstrates that a “service-oriented architecture” will best support each of the phases of the research lifecycle and will enable the use and reuse of data by users with varying backgrounds.

CHALLENGES/ISSUES FACED BY THE EXTRAMURAL COMMUNITY
– A SOCIAL SCIENCE PERSPECTIVE


Social scientists are increasingly using biomedical data in research. Hauser, [2010] et al. have described “a growing tendency for social scientists to collect biological specimens such as blood, urine, and saliva as part of large-scale household surveys.” By combining social and behavioral measures with biological measures, researchers are able to answer questions and make new connections in their field of inquiry. For instance, “it becomes possible, for example, to estimate the distribution of a particular genetic variant within a representative sample of the general population and to correlate genetic variations with differences in human phenotypes.” [Hauser, 2010]

A number of government surveys, such as the National health and Nutrition Examination Survey, collect biological samples, and the measures are recorded in resulting public-use statistical files. However, conducting surveys in which biological specimens are gathered has new challenges for the individual social scientist, which are financial, legal or ethical in nature, but also have to do with archiving and sharing data. Some areas have had a fair amount of attention, built on the experiences gained by non-governmental larger scale research projects, such as the Study of Women’s Health Across the Nation (SWAN). In terms of policies and suggested procedures, issues on gaining access to, collecting, storing and using biomedical specimens have been addressed somewhat. However, the challenges in protecting privacy and confidentiality, informed consent, and data sharing are complex and will benefit from collaborative efforts to find solutions to problems faced by all researchers regardless of methodological approach or discipline.

Some best practices have emerged through experience. There are numerous documents describing how specimens should be handled in laboratories, and university offices for protection of human subjects have detailed required protocols. Professional societies have also developed recommendations regarding how to organize and manage bio-data repositories, such as the ISBER 2008 document Best Practices for Repositories: Collection, Storage, Retrieval and Distribution of Biological Materials for Research. However, as Hauser [2010] states, while the best practices so far developed are helpful, “they do not address questions that are most closely related to the design of the research itself, such as choice of biospecimen (e.g., blood, urine, saliva), choice of biomarker, and choice of assay”. Further, “the data archive (i.e., the collection of data derived from the specimens, as well as the data from the survey) is likely to be maintained separately from the specimens themselves, while documentation about the specimen collection and survey protocols may be archived in yet another location.” Because there are these multiple storage and access sites, social scientists face challenges in ensuring that all parts of a research project are equally preserved, secure and managed for the long term.

STANDARDS/PRACTICES FOR ACKNOWLEDGMENT OF THE USE OF DATA


Any information that is essential for the full understanding of a published work should be recognized as an essential part of the scholarly record. Where such information is not directly incorporated as an integral part of the publication itself, this integral material should be cited as evidence. And all information that is cited, should be accessible to the scientific community.

As Altman & King [2007] point out, omitting data citation, or using ad-hoc footnotes, local id numbers, or other schemes, threatens the integrity of the scientific record:

The data cited [in an ad-hoc way, lacking minimal standards] may no longer exist, may not be available publicly, or may have never been held by anyone but the investigator. Data listed as available from the author are unlikely to be available for long and will not be available after the author retires or dies. Sometimes URLs are given, but they often do not persist. In recent years, a major archive renumbered all its acquisitions, rendering all citations to data it held invalid; identical data was distributed in different archives with different identifiers; data sets have been expanded or corrected and the old data, on which prior literature is based, was destroyed or renumbered and so is inaccessible; and modified versions of data are routinely distributed under the same name, without any standard for versioning. Copyeditors have no fixed rules, and often no rules whatsoever. Data are sometimes listed in the bibliography, sometimes in the text, sometimes not at all, and rarely with enough information to guarantee future access to the identical data set. Replicating published tables and figures even without having to rerun the original experiment, is often difficult or impossible.

Science provides a succinct statement of this evidential principle in its General Information for Authors: “Citations to unpublished data and personal communications cannot be used to support claims in a published paper.” and ”All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science.” [Science 2011] Citation standards for data have been recently adopted by the American Sociological Association, the OECD, and over 20 institutional members of the DataCite coalition, and are emerging as a best practice in publishing [Alter 2012; NDSA 2012]

As Alter [2012] notes:

Scientists who create digital data have a right to expect their contributions to be recognized through citations in publications based on those data. Citation has been the standard way of recognizing original scholarship for hundreds of years. As we noted above, academic careers are measured by citations, and proper citation of data would credit data producers for the impact of their work on science. Citations can also be linked to funding sources (e.g., grant numbers) in ways that can be captured to measure the impact of Federal investments on scientific productivity.

And also:

Assigning proper citations and persistent identifiers to data resources is critical to enabling reuse and verification of data, understanding and tracking the impact of research data, and creating a structure that recognizes and rewards data producers for their contributions to the scientific record. Many data archives and repositories now provide citations that should be used in publications based on the data, and many are also registering persistent identifiers for the data they manage. Data citations permit data to be integrated into the system of scholarly communications and to be picked up by the electronic citation services so that data usage can be tracked.

Citation themselves need not be complex. Altman-King [2007] provide a set of minimal citation elements, most of which have been incorporated in subsequent data citation approaches, and which ensure the reliability of the citation. A citation should include the following elements: author (or authoring entity), title (possibly a generic title), a date (or formal database version, if available), a persistent identifier (such as a DOI), and some form of fixity information (that can be used to validate data retrieved later).

And publication of the citations is also straightforward. Citations to data should be treated as first-class references, and treated in the same manner as citation to other publications. Authors should acknowledge the use of data by including citations to the data in the references section of their papers (where citation to other publications are also recorded). Treating data citations as first-class references provides attribution and recognition of the importance of data as an intellectual product. Journals and other publishers should require data citation as a perquisite for publication, just as they require proper citation of other publications referenced as evidence. And cited data should be held to similar standards of durability and accessibility as other cited works integral to the understanding and reproduction of the published results.

Indeed, since science is not merely about behaving scientifically, but also requires a community of scholars competing and cooperating to pursue common goals, scholarly citation of data can be viewed as an instantiation of a central feature of the whole enterprise.


WAYS TO IMPROVE THE EFFICIENCY OF DATA ACCESS REQUESTS


To improve the efficiency of responses to data access requests, data sharing needs to be built into the research and publication workflow — and not treated as a supplemental activity to be performed after the research project has been largely completed. Over two decades ago, the path-breaking report Sharing Research Data, by the Committee on National Statistics [1985], identified this and related requirements in its core recommendations:

  • Recommendation 1: Sharing data should be a regular practice.
  • Recommendation 2: Investigators should share their data by the time of publication of initial major results of analyses of the data except in compelling circumstances.
  • Recommendation 3: Data relevant to public policy should be shared as quickly and widely as possible.
  • Recommendation 4: Plans for data sharing should be an integral part of a research plan whenever data sharing is feasible.


Increasingly, these recommendations have been recognized by data management requirements and policies. However, all too often these recommendations are not followed in practice. Strengthening the implementation of recommendation two, which requires simultaneous publication of data along with derived research results, so that policies include reporting and auditing, would likely yield substantial improvements in practice. The data citation standards and practices described above could readily be utilized to improve systematic reporting and tracking in this area.

In addition, inconsistent and unsophisticated treatment of information confidentiality and security have become a major stumbling block to efficient access to and use of research data. A series of reports by the National Research Council [2005, 2007, 2009, 2010], have reinforced the following key points:

  • One size does not fit all — multiple modes of access are needed to confidential data:
    “Recommendation 2: Data produced or funded by government agencies should continue to be made available for research through a variety of modes, including various modes of restricted access to confidential data and unrestricted access to public-use data altered in a variety of ways to maintain confidentiality.” [NRC 2005]
  • The complexity, detail, richness, and temporal extents of new forms of scientific data — such as geospatial traces (linked social-spatial data), long-term longitudinal studies, social networks, and rich genomic data — create significant uncertainties about the ability for traditional ‘anonymization’ methods and standards to protect the confidentiality of research participants.
  • The approach taken by the HIPAA Privacy Rule, which emphasizes ‘deidentification’ through suppression of data values is a poor fit for much data, even within health research. The HIPAA approach is neither necessary nor sufficient to protect confidentiality. Moreover, the data suppression techniques used by the approach can severely impair the utility of the data, and lead to biased research results.


Furthermore, numerous responses to the recent ANPRM on proposed changes to the common rule, including extensively researched responses from Harvard [Barnes 2011], from 22 leading research organizations [COSSA 2011] and from leading computer scientists and privacy scientists [Sweeney 2011; Vadhan 2011] noted how overreaching and unsophisticated approaches to confidentiality can drastically erode the value of data sharing and reuse.

These responses to the recent ANPRM emphasized the points made in the NRC reports above:

  • Multiple modes of access to confidential information should be provided including access to the original data through data enclaves/under restricted use agreements
  • The HIPAA, which emphasizes deidentification through removal of enumerated categories of information, is a poor fit for social science and behavioral research.


These responses emphasize that treatment of privacy risks require a nuanced approach. Like treatment of other risks to subjects, treatment of privacy risks should be based on a scientifically informed analysis that includes the likelihood of such risks being realized, the extent and type of the harms that would result from realization of those risks, the availability and efficacy of technical, computational/statistical methods to mitigate risks, and the availability of legal remedies. The responses by data privacy and computer science researchers [Sweeney, et al., 2010; Vadhan, et al. 2010] provide a roadmap for simultaneously increasing data sharing and privacy protections by leveraging advances in theoretical computer science; creating legal mechanisms for accountability and transparency; and establishing a task force of privacy experts that would develop and update safe-harbor rules that would apply to emerging forms of data and disclosure limitation methods.

REFERENCES

Alter, G. 2012. Response to RFI: “Public Access to Digital Data Resulting From Federally Funded Scientific Research” Office of Science and Technology Policy” (response on behalf of the Inter-University Consortium for political and Social Research.” Available from: http://www.data-pass.org/sites/default/files/ICPSR%20Response%20to%20RFI%20Public%20Access%20to%20data.pdf

Altman, M., & King, G. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data”. D-Lib Magazine, 13(3/4), Available from: http://www.dlib.org/dlib/march07/altman/03altman.html

Barnes 2011. Re: Human Subject Research Protections: Enhancing Protections for Research Subjects and Reducing Burden, Delay and Ambiguity for Investigators, Federal Register Vol 76, No. 143, July 26, 2011 (response on behalf of Harvard University). Available from: http://dataprivacylab.org/projects/irb/HarvardUniversity.pdf

Bernstein, H. J., Folk, M. J., Benger, W., Dougherty, M. T., Eliceiri, K. W. and Schnetter, E. (2011). Communicating Scientific Data from the Present to the Future. Dowling College position paper. Temporary URL: http://www.columbia.edu/~rb2568/rdlm/Bernstein_Dowling_RDLM2011.pdf

Cossa 2011. Social and Behavioral Science White Paper on Advanced Notice for Proposed Rulemaking (ANPRM) Federal Register 44512­531 (July 26, 2011); ID Docket HHS­OPHS­2011­0005. Available from: http://dataprivacylab.org/projects/irb/COSSA.pdf

Humphrey, C. & Hamilton, E.  2004.  “Is it Working? Assessing the Value of the Canadian Data Liberation Initiative.”  Bottom Line, Vol. 17 (4), pp. 137-146. Available from:
http://datalib.library.ualberta.ca/~humphrey/lifecycle-science060308.doc

Data-PASS 2011. “Response to Office of Science and Technology Policy Request for Information on Public Access to Digital Data Resulting from Federally Funded Scientific Research”. Available from: http://www.data-pass.org/sites/default/files/datapass-otsp-rfi-response.pdf

NDSA 2011. “Response to Office of Science and Technology Policy Request for Information on Public Access to Digital Data Resulting from Federally Funded Scientific Research”.
Available from: http://digitalpreservation.gov/documents/NDSA_ResponseToOSTP.pdf

National Research Council. 2005. Expanding access to research data: Reconciling risks and opportunities. Washington, DC: The National Academies Press.

National Research Council. 2007. Putting people on the map: Protecting confidentiality with linked social­spatial data. Washington, DC: The National Academies Press.

National Research Council. 2009. Beyond the HIPAA privacy rule: enhancing privacy, improving health through research. Washington, DC: The National Academies Press.

National Research Council. 2010. Conducting biosocial surveys: Collecting, storing, accessing, and protecting biospecimens and biodata. Washington, DC: The National Academies Press.

NRC. 2011. Communicating Science and Engineering Data in the Information Age. National Academies Press. Available from: http://www.nap.edu/catalog.php?record_id=13282

NSB 2011. Digital Research Data Sharing and Management. (Draft) NSB-11-17. Available from: http://www.nsf.gov/nsb/publications/2011/nsb1124.pdf

Science staff. 2011. “General Information for authors.” Available from: http://www.sciencemag.org/site/feature/contribinfo/prep/gen_info.xhtml

OTSP. 2011. “Public Access to Digital Data: Public Comments”. Available from: http://www.whitehouse.gov/administration/eop/ostp/library/digitaldata

Poschen, M., et al. 2010. User-Driven Development of a Pilot Data Management Infrastructure for Biomedical Researchers.” Available from: https://www.escholar.manchester.ac.uk/api/datastream?publicationPid=uk-ac-man-scw:117518&datastreamId=FULL-TEXT.PDF

Shahand, s. et al. 2011. “Front-ends to Biomedical Data Analysis on Grids.” Available from: http://www.bioinformaticslaboratory.nl/twiki/pub/EBioScience/EBioinfraGateUserDoc/ebioinfragate.pdf

Sweeney, L. , et al. 2010. “Comments from Data Privacy Researchers”. Available from: http://dataprivacylab.org/projects/irb/DataPrivacyResearchers.pdf

UK Data Archive. 2012. “Research Data Life-cycle.”
http://www.data-archive.ac.uk/create-manage/life-cycle

Vadhan, S. , et al. 2010. “Re: Advance Notice of Proposed Rulemaking: Human Subjects Research Protections”. Available from: http://dataprivacylab.org/projects/irb/Vadhan.pdf


Feb 29, 7:08pm

NISO posted for comment a draft “Recommended Practice for Online Supplemental Journal Article Materials”. The following is my individual perspective on the report.

Response to request for public comments

Dr. Micah Altman
Director of Research; Head/Scientist, Program for Information Science — MIT Libraries, Massachusetts Institute of Technology
Non-Resident Senior Fellow, The Brookings Institution
(Writing in a personal capacity)

Introduction

Thank you for the opportunity to comment on these recommended practices. Supplemental materials have become an increasingly important part of the scholarly record. This report provides a thoughtful framework for developing systematic practices to publish and to steward this content.

As a practicing social scientist, I have attempted to replicate and extend research in my field, and published on the challenges of reproducibility. [Altman, et. al 2003] In my role as an information scientist and administrator, I have led projects and contributed to community-wide efforts to build and maintain open infrastructure and standards for the documentation, dissemination and preservation of research data. [Altman et. al 2001; Altman & King 2007] My contribution is made from this perspective.

Preamble

A substantial proportion of supplementary materials are most naturally characterized as “data”. Data are often critical to fully understanding, evaluating, replicating, and verifying articles — perhaps more than other types of supplementary material. So the task force may wish to take note of broad-based and thoughtful commentary on data publishing that has emerged from the research community recently, such as the following:

  • The National Science Board’s draft report on Digital Research Data Sharing and Management, which emphasizes the values of and challenges to open access to data.
  • The NRC’s recently released prepublication report on Communicating Science and Engineering in the Information Age, which develops a number of recommendations, that although directed at NCSES are readily applicable to research data management, publication, and dissemination in general. Specifically, recommendations 3-1, 3-2, 3-3, and 3-4 together represent good practice for data management and publication. Published results are more likely to be reliable when management of the data supporting them incorporates versioning, open formats and protocols, machine actionable metadata, and management of provenance from data collection through publication. [NRC 2011]
  • Numerous responses to the recent OSTP Request for Information on Public Access to Digital Data Resulting from Federally Funded Scientific Research [OSTP 2011], which comment on the benefits of data access, and draw attention to needs, protocols, and standards for open data access and interoperability. Notably, the responses of the National Digital Stewardship Alliance, the Data-Preservation Alliance for the Social Sciences, Carnegie Mellon University, the University of California Libraries, and the International Consortium for Political and Social Research, reference the need for and successful exemplars of community-based standards for open data dissemination, discovery, and preservation.

Responses

The intent of these responses is not to dispute the overall purpose or framework, but to identify areas of emerging standardization around treatment of data that could be used to further refine and extend the recommendations.

Inconsistent treatment of data as evidence. 

The Draft’s general principle 1.3.2 states that practices must reflect the information future researchers will need to understand and build on articles today. And in most scientific articles, access to data is critical to enable another researcher to understand, assess, and extend the results. Yet 1.3.12 limits the scope of recommendation for treatment of data to the case when they are published as supplementary materials.

Related to this, the distinction made in the Draft between “integral”, “additional”, and “related” content conflates two different types of relationships among supporting information and core articles. In the Draft, “integral content” is defined as supplemental material (material included with the article) “that is essential for the full understanding of the work”; “additional content” is supplemental material that is useful for a deeper understanding of the article. Further, “related content” content that is not included with the article/submission package.  This tripartite categorization conflates evidentiary properties (whether or not one work is essential to the understanding of another) with locational/administrative properties (where the other work is/who manages it).

The Draft disclaims all publisher toward related content, does not include it in any stakeholder roles and responsibility, and (for data) declares it out of scope. This is problematic in practice. Where data that is essential for full understanding is available, it is treated as part of the scholarly record.  But if such data is  managed, published or disseminated  separately from the article, it is ignored. And when integral information is not cited or not available,  the integrity of the scholarly record is necessarily weakened.

The Draft should be revised so that principle 1.3.2 takes precedence —  information  that is essential for full understanding of a work should be understood to be an “integral” and crucial part of the scholarly record, regardless of where that information happens to be.  If integral information is not submitted or managed along with a publication, it should still be cited as evidence; and information that is cited as evidence should be accessible to the scientific community. (Detailed requirements for data citation are elaborated in Altman-King 2007; as well as in many of the reports referenced in the preamble, above).   Science provides a succinct statement of this principle in its General Information for Authors: “Citations to unpublished data and personal communications cannot be used to support claims in a published paper.”, and  “All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science.”  [Science 2011]

Short and Long-Term Access to Data

The current Draft claims that best practice is to treat rights for Integral Content in the same manner  as the core article. The implication is that most supplementary data would remain restricted, which creates a barrier to extending research results, and a barrier to research that integrates supplementary data from many  articles. In contrast, the reports and comments referenced in the preamble above [NRC 2011; NSB 2011; OTSP 2011] emphasize the special value of open access to data.  This suggests that the emerging best practice for integral data is open access, and, at the very least, that integral data  be distributed under terms “no more restrictive than” the core article.

The reports and comments above also note the critical importance of providing data in open formats, using open protocols, and accompanied by machine actionable metadata. The current Draft does not include or reference these emerging best practices. In order to ensure that integral data can be meaningfully accessed (fulfilling its role in understanding, assessment, and extension) the practices  identified in other data-sharing recommendations should also be adopted in the Draft.

REFERENCES

Altman, M., Gill, J., & McDonald, M. (2003). Numerical issues in statistical computing for the social scientist. New York: John Wiley & Sons

Altman, M., & King, G. (2007). A Proposed Standard for the Scholarly Citation of Quantitative Data. DLib Magazine, 13(3/4), Available from: http://www.dlib.org/dlib/march07/altman/03altman.html

Data-PASS 2011. “Response to Office of Science and Technology Policy Request for Information on Public Access to Digital Data Resulting from Federally Funded Scientific Research”. Available from: http://www.data-pass.org/sites/default/files/datapass-otsp-rfi-response.pdf

NDSA 2011. “Response to Office of Science and Technology Policy Request for Information on Public Access to Digital Data Resulting from Federally Funded Scientific Research”.
Available from: http://digitalpreservation.gov/documents/NDSA_ResponseToOSTP.pdf

NRC. (2011). Communicating Science and Engineering Data in the Information Age. National Academies Press. Available from: http://www.nap.edu/catalog.php?record_id=13282

NSB (2011).  Digital Research Data Sharing and Management. (Draft) NSB-11-17. Available from: http://www.nsf.gov/nsb/publications/2011/nsb1124.pdf

Science staff. (2011) General Information for authors. Available from: http://www.sciencemag.org/site/feature/contribinfo/prep/gen_info.xhtml

OTSP.  (2011) “Public Access to Digital Data: Public Comments”. Available from: http://www.whitehouse.gov/administration/eop/ostp/library/digitaldata


Jan 19, 10:38am

The National Science Board offered a recent opportunity to comment on the draft report on ‘Digital Research Data Sharing and Management’ by the task force on data policies. The following is my individual perspective on the report.

Response to request for public comments

Dr. Micah Altman
Senior Research Scientist, IQSS, Harvard U. (until 2/29)
Director of Research; Head/Scientist, Program on Information Research — MIT Library, Massachusetts Institute of Technology (as of 3/1/2012)
Non-Resident Senior Fellow, The Brookings Institution
(Writing in a personal capacity)

Introduction

Thank you for the opportunity to respond to this report. I believe this report will advance the discussion of research data sharing and management, raises many thoughtful questions, and makes recommendation that will have a positive impact on the conduct of scientific research.

As a practicing social scientist,  my collaborators and I have attempted to replicate and extend research in my field, and published on the challenges of reproducibility. [Altman, et. al 2003] And in my role as an administrator, I have lead projects and contributed to community-wide efforts to build and maintain open infrastructure and standards for the documentation, dissemination and preservation of research data. [Altman et. al 2001; Altman & King 2007] My contribution is made with this perspective.

Preamble

The task force may wish to take note of broad-based and thoughtful commentary on data management that has emerged from the research community recently, such as the following:

  • Numerous responses to the recent ANPRM on proposed changes to the common rule commented on the relationship between data sharing and privacy. Notably, two responses by data privacy and computer science researchers provide a roadmap for simultaneously increasing data sharing and privacy protections by leveraging advances in theoretical computer science, and by establishing mechanisms for accountability and transparency in data sharing. [Sweeney, et al., 2010; Vadhan, et al. 2010]
  • Numerous responses to the recent OSTP Request for Information on Public Access to Digital Data Resulting from Federally Funded Scientific Research, comment on the benefits of data access, and draw attention to protocols and standards for open data access and interoperability. Notably, the responses of the National Digital Stewardship Alliance, and of the Data-Preservation Alliance for the Social Sciences point to successful exemplars community-based standards for open data dissemination, discovery, and preservation. [NDSA 2011, Data-PASS 2011]
  • The NRC’s recently released prepublication report on Communicating Science and Engineering in the Information Age (which supersedes the letter report cited by the task force) develops a number of recommendations that although directed at NCSES are readily applicable to research data management and dissemination in general. Specifically, recommendations 3-1, 3-2, 3-3, and 3-4 together represent good practice for data management in general: Published results are more likely to be reliable, when management of the data supporting them incorporates versioning, open formats and protocols, machine actionable metadata, and management of provenance from data collection through publication. [ NRC 2011 ]

Responses

The intent of these responses is not to dispute the recommendations of the task force, but to identify areas of emerging standardization that could be used to further refine and extend the recommendations.

Recommendation 1, and the discussion related to it, calls for NSF to provide leadership in policy development, notes the diversity of stakeholder communities, and cautions against one-size-fits-all solutions. This point is well-taken, as each discipline should be empowered to set priorities for embargo policies, documentation standards, and the like. Nevertheless, as the NDSA recommendations emphasize, some baseline requirements should be applied to all research data management:

Notwithstanding, there are still baseline conditions or requirements that apply to all data
regardless of discipline, particularly as they relate to archiving and preservation. For most
data, “open access” is needed not only for the short term, but for the long term. And
scientific disciplines have focused primarily on short-term access. There are critical
standards for metadata exchange, fixity information and verification, and persistent
citation that can support long-term access to data, preservation, and the long-term
reproducibility of public results. [NDSA 2011]

Recommendation 2 calls for grantees to make data, methods and techniques available to verify and extend figures, tables, findings, and conclusions. The recommendation also notes that data should be shared using persistent electronic identifiers.

This point is also well-taken, and would greatly accelerate scientific progress in many fields. The task force may also wish to consider the emerging body of work that demonstrates that scientific publications should, in addition to including persistent identifiers for data, treat references to data in a manner consistent with references to other scientific works — publications should include full citations to data in the standard reference section, and these should be indexed along with other references. [Data-PASS 2011; Altman & King 2007]

Recommendations 4 and 5, and the discussion related to them, emphasize the need for the stakeholders to convene and explore business models; the need for an expansion sustainable data management; and the lack of sufficient standards and business models.

This is clearly right. Notwithstanding, there are a number of successful standards and models that have emerged in different communities, and which the task force may wish to consider as exemplars. Moreover, it is important to note that standards and business models are insufficient. In addition, as the NDSA response points out, it is critical that the capability for data management be demonstrated, rather than asserted:

“Memory institutions such as archives, libraries and museums have an extensive track record with these functions and collaborative organizations such as NDSA could serve the essential purpose of developing or implementing frameworks that thoroughly test and certify assertions.” [NDSA 2011]

REFERENCES

Altman, M., Gill, J., & McDonald, M. (2003). Numerical issues in statistical computing for the social scientist. New York: John Wiley & Sons

Altman, M., & King, G. (2007). A Proposed Standard for the Scholarly Citation of Quantitative Data. DLib Magazine, 13(3/4), Available from: http://www.dlib.org/dlib/march07/altman/03altman.html

Data-PASS 2011. “Response to Office of Science and Technology Policy Request for Information on Public Access to Digital Data Resulting from Federally Funded Scientific Research”. Available from: http://www.data-pass.org/sites/default/files/datapass-otsp-rfi-response.pdf

NDSA 2011. “Response to Office of Science and Technology Policy Request for Information on Public Access to Digital Data Resulting from Federally Funded Scientific Research”.
Available from: http://digitalpreservation.gov/documents/NDSA_ResponseToOSTP.pdf

NRC. (2011). Communicating Science and Engineering Data in the Information Age. National Academies Press. Available from: http://www.nap.edu/catalog.php?record_id=13282

Sweeney, L. , et al. 2010. “Comments from Data Privacy Researchers”. Available from: http://dataprivacylab.org/projects/irb/DataPrivacyResearchers.pdf

Vadhan, S. , et al. 2010. “Re: Advance Notice of Proposed Rulemaking: Human Subjects Research Protections”. Available from: http://dataprivacylab.org/projects/irb/Vadhan.pdf


Pages