Jun 26, 3:11pm

Three weeks ago, NIH issued a request for information to solicit comments on the development  an NIH Data Catalog as part of its overall Big Data to Knowledge (BD2K) Initiative.

The Data Preservation Alliance for Social Sciences issued a response to which I contributed. Two sections are of general interest to the library/stewardship community:

Common Data Citation Principles and Practices


While there are a number of different communities of practice around data citation, a number of common principles and practices can be identified. 


The editorial policy Science [see ] is an exemplar of two principles  for data citations: First, that published claims should cite the evidence and methods upon which they rely, and second, that things cited should be available for examination by the scientific community.  These principles have been recognized across a set of communities and expert reports, and are increasingly being adopted by number of other leading journals. [See Altman 2012; CODATA-ICSTI Task Group, 2013; and]


Implementation Considerations


Previous policies aiming to facilitate open access to research data have often failed to achieve their promise in implementation. Effective implementation requires standardizing core practices, aligning stakeholder incentives, reducing barriers to long-term access, and building in evaluation mechanisms.


A set of core recognized good practices have emerged that span fields. Good practice includes separating the elements of citation from the presentation; including in the elements identifier, title, author, and date information, and where at all possible version and fixity information; and listing data citations in the same place as citation to other works – typically in the references section. [See Altman-King 2006; Altman 2012; CODATA-ICTI Task Group 2013; ; ]


Although the incentives related to data citation and access are complex, there are a number of simple points of leverage. First, journals can both create positive incentives for sharing data by requiring that data be properly cited. Second, funders can require that only those outputs of research that comply with access and citation policies can be claimed as results from prior research.

You may read the full response on the Data-PASS site. 

Jun 20, 10:31am

Metdata can be defined variously as “data about data”, digital ‘breadcrumbs’, magic pixie dust, and “something that everyone now knows the NSA wants a lot of”. It’s all of the above.

Metadata is used to support decision and workflows, add value to objects (through enhancing discover, use, reuse, and integration), and to support evaluation and analysis. It’s not the whole story for any of these things, but it can be a big part.

This presentation, invited for a workshop on Open Access and Scholarly Books (sponsored by the Berkman Center and Knowledge Unlatched),  provides a very brief overview of metadata design principles, approaches to evaluation metrics, and some relevant standards and exemplars in scholarly publishing. It is intended to provoke discussion on approaches to evaluation of the use, characteristics, and value of OA publications.

Jun 11, 11:51am

Best practices aren’t.

The core issue is that there are few models for the systematic valuation of data:  We have no robust general proven ways of answering the question of how much  data X be worth to community Y at time Z. Thus the “bestness” (optimality) of practices are generally strongly dependent on operational context.. and the context of data sharing is currently both highly complex and dynamic Until there is systematic descriptive evidence that best practices are used, predictive evidence that best practices are associated with future desired outcomes, and causal evidence that the application of best practices yields improved outcomes, we will be unsure that practices are “best”.

Nevertheless, one should use established “not-bad” practices, for a number of reasons. First, to avoid practices that are clearly bad; second, because use of such practices acts to document operational and tacit knowledge; third because selecting practices can help to elicit the underlying assumptions under which practices are applied; and finally because not-bad practcies provide a basis for auditing, evaluation, and eventual improvement.

Specific not-bad practices for data sharing fall into roughly three categories :

  • Analytic practices: lifecycle analysis & requirements analysis
  • Policy practices for: data dissemination, licensing, privacy, availability, citation and reproducibility
  • Technical practices for sharing and reproducibility, including fixity, replication, provenance

This presentation at the Second Open Economics International Workshop (sponsored by the Sloan Foundation, MIT and OKFN) provides an overview of these and links to specific practices recommendations, standards, and tools:

Jan 10, 4:37pm

MIT has a wonderful tradition of offering a variety of short courses during the Winter hiatus between semesters, known as IAP (Independent Activities Period). These range from how-to sessions, forums, athletic endeavors, lecture series, films, tours, recitals and contests. The MIT Libraries are offering dozens of courses on data and information management (among other topics) — I participated in a roundtable session on data management.

IAP period seems like an opportunity pass on some of the invisible knowledge of the academy; things like project management for science; managing bibliographies; care and feeding of professional networks; maintaining your tenure file; responding to reviewers; turning a dissertation into a book; communicating your work to the public & media; or writing compelling proposals.

So, for this year’s session, I updated my long-running “Getting Funding for your Research Course” with new resources, statistics, and MIT-specific information. This short course focuses on the area of communicating research projects and ideas in the form of proposals for support. The slides are below:

I aim to convert this to a webinar this year. Also, many of the main points are summarized in an article I’d written a few years ago, “Funding, Funding“.

Jan 08, 1:33pm

Yes…at least in Virginia.

My collaborator Micheal McDonald and I are now just catching up to analyzing the data that resulted from the crowd-sourcing participative electoral mapping projects we were involved in. The results were surprisingly clear and consistent:

  • Students are quite capable of creating legal districting plans.
  • Student plans generally demonstrated a wider range of possibilities as compared to legislative plans.
  • The ‘best’ plan, as ranked by each individual criterion, was a student plan.
  • The student plans covered a larger set of possible tradeoffs among each criterion.
  • Student plans were generally better on pairs of criteria. 
  • Student plans were more competitive and had more partisan balance than any of the adopted plans. 

Most of these trends can be made clear through information visualization methods.

One of the visuals is below — it shows how each of the plans scored on different pairs of criteria. The blue elipses contain the student plans, the green ones contain the legislative plans, and the red lines show the commission plans. A tiny ‘A’ shows the adopted plan. In each mini-plot the top-right corner is where the theoretically best scores are.

(Click on the image below to enlarge)


Notice that blue elipses are almost always bigger, contain most of the green elipses (or at least the best-scoring parts) and extend further toward to the top-right corner. Go students!

More details on the data, evaluations, etc. in the article (pre-copy-edited version).

Dec 25, 12:15pm

The core ideas around distributed digital preservation are well-recognized in the library and archival communities. Geographically distributed replication of content has been clearly identified for at least a decade as a best practice for management of digital content, to which one must provide long-term access. The idea of using replication to reduce the risk of loss was explicit in the design of LOCKSS a decade ago. More recently, replication has been incorporated into best practices and standards. In order to be fully “trusted”, an organization must have a managed process for creating, maintaining, and verifying multiple geographically distributed copies of its collections — and this is now recognized as good community practice. Further, this requirement has been incorporated in Trustworthy Repositories Audit & Certification (TRAC) (see section C.1.2-C.1. in [8]), and in the newly released ISO standard that has followed it.

However, while geographic replication can be used to mitigate some risks (e.g. hardware failure), less technical risks such as curatorial error, internal malfeasance, economic failure, and organizational failure require that replications be diversified across distributed, collaborative organizations. The LOCKSS research team has also identified a taxonomy of potential single-points-of-failure (highly correlated risks), that at minimum, a trustworthy preservation system should mitigate against. These risks include media failure, hardware failure, software failure, communication errors, network failure, media and hardware obsolescence, software obsolescence, operator error, natural disaster, external attack, internal attack, economic failure and organizational failure [9].

There are a number of existing organizational/institutional efforts to mitigate threats to the digital scholarly record (and scientific evidence base) through replication of content. Of these, the longest continuously-running organization is LOCKSS. (Currently, the LOCKSS organization manages the Global LOCKSS Network, in which over one hundred libraries collaborate to replicating over 9000 e-journal titles from over five hundred publishers. Furthermore, there are ten different membership organizations currently using the LOCKSS system to host their own separate replication networks.) Each of these membership organizations (or quasi-organizations) represent a variation on the collaborative preservation model, comprises separate institutional members, and targets a different set of content.

Auditing is especially challenging for distributed digital preservation, and essential for four reasons:

1. Replication can prevent long-term loss of content only when loss or corruption of a copy is detected and repaired using other replicates — this is a form of low-level auditing. Without detection and repair, it is a statistical near-certainty that content will be lost to non-malicious threats. Surprisingly, one reduces risks far more effectively by having a few replicates and verifying their fixity very frequently, than one does by having less frequent auditing and more replicas — at least for the types of threats that affect individual (or small groups of) content objects at random (e.g. many forms of result of media failure, hardware, software, and curatorial error).

2. The point of the replication enterprise is recovery. Regular restore/recovery audits that test the ability to restore the entire collection or randomly selected collections/items are considered good practice even to establish the reliability of short term backups and are required by IT disaster-recovery planning frameworks. Archival recovery is an even harder problem because one needs to validate not only that a set of objects are recoverable, but that the collection recovered also contains sufficient metadata and contextual information to remain interpretable! A demonstration of the difficulty of this was the AIHT exercise sponsored by the Library of Congress, which demonstrated that many collections thought to be substantially “complete” could not be successfully re-ingested (i.e. recovered) by another archive even in the absence of bit-level failures. [28,29] Because DPN is planned as a dark repository, recovery must be demonstrated within the system — it cannot be demonstrated through active use.

3. Transparent, regular, and systematic auditing is one of the few available safeguards against insider/internal threats.

4. Auditing of compliance with higher-level replication policies is recognized as essential for managing risks generally . For scalability, auditing of policies that apply to the individual management of collections must be automated. In order to automate these policies, there must be a transparent and reliable mapping from the higher level policies governing collections and their contents to the information and controls provided by technical infrastructure mapped.

This presentation, delivered at CNI 2012, summarizes the lessons learned from trial audits of several production distributed digital preservation networks. These audits were conducted using the open source SafeArchive system, which enables automated auditing of a selection of TRAC criteria related to replication and storage. An analysis of the trial audits demonstrates both the complexities of auditing modern replicated storage networks, and reveals common gaps between archival policy and practice. Recommendations for closing these gaps are discussed, as are extensions that have been added to the SafeArchive system to mitigate risks in distributed digital preservation (DDP).

Nov 15, 1:07pm

Amazon recently announced integration of their core S3 service with their low-cost storage system, Glacier. This facilitates the ability to add rules to S3 (or their reduced redundancy store) based on age, date, and S3 bucket prefix.

Regular incremental improvement and integration is a signature of Amazon’s modus operandi for its cloud services: Amazon has a pattern of announcing updates every few weeks that add services, integrate existing services, or (occasionally) lower prices. And they have introduced incremental improvements to the AWS platform over a dozen times since Glacier was announced at the end of August.

Interestingly, Glacier is an apt metaphor for this low-cost service in that it not only signifies “cold” storage, but also signals a massive object making its way slowly but inexorably across a space, covering everything in its path.

Why Glacier is Important

Why is Glacier important? First, as James Hamilton (disclosure, James is VP and Distinguished Engineer at Amazon) aptly summarizes, Glacier provides the volume economics for multi-site, replicated, cold storage (near-line) to small-scale and medium-scale users. While do-it-yourself solutions based on automated tape libraries can still beat Glacier’s price by a huge margin, the sweet spot for this approach has been shifted out so that only very large enterprises are likely to beat the price on Glacier by rolling out their own solutions using tape libraries, etc.

Second, many businesses, and also many services, are built upon or backed up through AWS and S3. Amazon’s continued integration of Glacier into AWS will make it increasingly straightforward to integrate low-cost cold-storage replication into preservation services such as DuraCloud, backup services such as Zmanda, and even into simple software tools like Cyberduck.

Overall, I’m optimistic that this is a Good Thing, and will improve the likelihood of meaningful future access to digital content. However, there are a number of substantial issues to keep in mind when considering Glacier as part of a digital preservation solution.

Issue 1. Technical infrastructure does not guarantee long term durability

Although some commenters have stated that Glacier will “probably outlive us all“, these claims are based on little evidence. The durability of institutions and services relies as much upon economic models, business models, organizational models, and organizational mission as upon technology. Based on the history of technology companies, one must consider that there is a substantial probability that Amazon itself will not be in existence in fifty years, and the future existence of any specific Amazon service is even more doubtful.

Issue 2. Lock-in and future cost projections

As Wired dramatically illustrated, the costs of retrieving all of one’s data from Glacier can be quite substantial. Further, as David Rosenthal has repeatedly pointed out, the long-term cost-competitiveness of preservation services depends “not on their initial pricing, but on how closely their pricing tracks the Kryder’s Law decrease in storage media costs”. And that “It is anyone’s guess how quickly Amazon will drop Glacier’s prices as the underlying storage media costs drop.” The importance of this future price-uncertainty is magnified by the degree of lock-in exhibited by the Glacier service.

Issue 3. Correlated failures

Amazon claims a ‘design reliability’ of ‘99.999999999%‘. This appears to be an extremely optimistic number without any formal published analysis backing it. The number appears to be based on a projection of theoretical failure rates for storage hardware (and such rates are  wildly optimistic under production conditions), together with the (unrealistic) assumption that all such failures are statistically independent.  Moreover, this ‘design reliability’ claim is unsupported (at time of writing) by Glacier’s terms of service, SLA, or customer agreement. To the contrary, the agreements appear to indemnify Amazon against any loss of damage, does not appear to offer a separate SLA for Glacier, and limits recovery under existing SLA’s (for services, such as S3) to refund of fees for periods the service was unavailable. If Amazon were highly confident in the applicability of the quoted ‘design reliability’ to production settings, one might expect a stronger SLA. However, despite these caveats, my guess is that Glacier’s will still turn out to be, in practice, substantially more reliable than the DIY solutions that most individual organizations can afford to implement entirely in-house.

Nevertheless, as previously discussed (most recently at Digital Preservation 2012), a large part of risk mitigation for digital assets is to diversify against sources of correlated failure. Although implementation details are not complete, Glacier does appear to diversify against some common risks to bits — primarily media failure, hardware failure, and localized natural disaster (such as fire, flood). This is good, but far from complete. A number of likely single-point (or highly correlated) vulnerabilities remain, including software failure (e.g. a bug in the AWS software for its control backplane might result in permanent loss that would go undetected for a substantial time; or cause other cascading failures — analogous to those we’ve seen previously); legal threats (leading to account lock-out — such as this, deletion, or content removal); or other institutional threats (such as a change in Amazon’s business model). It is critical that diversification against these additional failures be incorporated into a digital preservation strategy.

Preliminary Recommendations

To sum up, Glacier is an important service, and appears to be a solid option for cold storage, but institutions that are responsible for digital preservation and long-term access should not use the quoted design reliability in modeling likelihood of loss, nor rely on Glacier as the sole archival mechanism for their content.

Nov 14, 5:04pm

Lately, our DistrictBuilder software, a tool that allows people to easily participate in creating election districts, has gotten some additional attention. We recently received an Outstanding Software Development Award from the American Political Science Association (given by the Information Technology & Politics Section) and a Data Innovation Award given by the O’Reilly Strata Conference (for data with social impact). And just last week, we had the opportunity to present our work to the government of Mexico at the invitation of the Instituto Federal Electoral, as part of their International Colloquium on Redistricting.

During this presentation, I was able to reflect on the interplay of algorithms and public participation. and it became even clearer to me that applications like DistrictBuilder exemplify the ability of information science to improve policy and politics.

Redistricting in Mexico is particularly interesting, since it relies heavily on facially neutral geo-demographic criteria and optimization algorithms, which represents a different sort of contribution from information science. Thus, it was particularly interesting to me to consider the interplay between algorithmic approaches to problem solving and “wisdom of crowd” approaches, especially for problems in the public sphere.

It’s clear that complex optimization algorithms are an advance in redistricting in Mexico, and have an important role in public policy. However, they also have a number of limitations:

  • Algorithmic optimization solutions often depend on a choice of (theoretically arbitrary) ‘starting values’ from which the algorithm starts its search for a solution.
  • Quality algorithmic solutions typically rely on accurate input data.
  • Many optimization algorithms embed particular criteria or particular constraints into the algorithm itself.
  • Even where optimization algorithms are nominally agnostic to the criteria used for the goal, some criteria are more tractable than others; and some are more tractable for particular algorithms.
  • In many cases, when an algorithm yields a solution, we don’t know exactly (or even approximately, in any formal sense) how good that solution is.

I argue that explicitly incorporating a human element is important for algorithmic solutions in the public sphere. In particular:

  • Use open documentation and open (non-patented, or open-licensed) to enable external replication of algorithms.
  • Use open source to enable external verification of the implementation of particular algorithms.
  • Incorporate public input to improve the data (especially describing local communities and circumstances) in algorithm driven policies.
  • Incorporate crowd-sourced solutions as candidate “starting values” for further algorithmic refinement.
  • Subject algorithmic output to crowd-sourced public review to verify the quality of the solutions produced.

You can see the slides, which include more detail and references below. For much such slides, refer to our PublicMapping project site.

Oct 25, 5:40pm

The workshop report from the UNC Curating for Data Quality workshop, in which I was delighted to participate, is now being made available. It contains many perspectives addressing a number of questions:

Data Quality Criteria and Contexts. What are the characteristics of data quality? What threats to data quality arise at different stages of the data life cycle? What kinds of work processes affect data quality? What elements of the curatiorial process most strongly affect data quality over time? How do data types and contexts influence data quality parameters?

Human and Institutional Factors. What are the costs associated with different levels of data quality? What kinds of incentives and constraints influence efforts of different stakeholders? How does one estimate the continuum from critical to tolerable errors? How often does one need to validate data?

Tools for Effective and Painless Curation. What kinds of tools and techniques exist or are required to insure that creators and curators address data quality?

Metrics. What are or should be the measures of data quality? How does one identify errors? How does one correct errors or mitigate their effects?

My current perspective, after reflecting on seven ‘quality’ frameworks from different disciplines that differ in complex and deep ways, is that the data quality criteria implied by the candidate frameworks are neither easily harmonized, nor readily quantified. Thus, a generalized systematic approach to evaluating data quality seems unlikely to emerge soon. Fortunately, developing an effective approach to digital curation that respects data quality does not require a comprehensive definition of data quality. Instead, we can appropriately address “data quality” in curation by limiting our consideration to a narrower applied questions:

Which aspects of data quality are (potentially) affected by (each stage of) digital curation activity? And how do we keep invariant data quality properties at each curation stage?

A number of approaches suggest seem particularly likely to bear fruit:

  1. Incorporate portfolio diversification in selection and appraisal.
  2. Support validation of preservation quality attributes such as authenticity, integrity, organization, and chain of custody throughout long-term preservation and use — from ingest through delivery and creation of derivative works.
  3. Apply semantic fingerprints for quality evaluation during ingest, format migration and delivery.

These approaches have the advantage of being independent of the content subject area, of the domain of measure, and of the particular semantics content of objects and collections. Therefore, they are broadly applicable. By mitigating these broad-spectrum threats to quality, we can improve the overall quality of curated collections, and their expected value to target communities.

My extended thoughts are here:

You may also be interested in the other presentations from the workshop, which are posted on the Conference Site.

Oct 23, 12:37pm

I was pleased to participate in the 2012 PLN Community Meeting.

Over the last decade, replication has become a required practice for digital preservation. Now, Distributed Digital Preservation (DDP) networks are emerging as a vital strategy to ensure long-term access to the scientific evidence base and cultural heritage. A number of DDP networks are currently in production, including CLOCKSS, Data-PASS, MetaArchive, COPPUL, Lukll, PeDALS, Synergies, Data One, and new networks, such as DFC and DPN are being developed.

These networks were created to mitigate the risk of content loss by diversifying across software architectures, organizational structures, geographic regions, as well as legal, political, and economic environments. And many of these networks have been successful at replicating a diverse set of content.

However, the point of the replication enterprise is recovery. Archival recovery is an even harder problem because one needs to validate not only that a set of objects is recoverable, but also that the collection recovered also contains sufficient metadata and contextual information to remain interpretable! A demonstration of the difficulty of this was the AIHT exercise sponsored by the Library of Congress, which demonstrated that many collections thought to be substantially “complete” could not be successfully re-ingested (i.e. recovered) by another archive, even in the absence of bit-level failures.

In a presentation co-authored with Jonathan Crabtree, we summarized some lessons learned from trial audits of several production distributed digital preservation networks. These audits were conducted using the open source SafeArchive system (, which enables automated auditing of a selection of TRAC criteria related to replication and storage. An analysis of the trial audits demonstrates both the complexities of auditing modern replicated storage networks, and reveals common gaps between archival policy and practice. It also reveals gaps in the auditing tools we have available. Our presentation, below, focused on the importance of designing auditing systems to provide diagnostic information that can be used to diagnose non-confirmations of audited policies. Tom Lipkis followed with specific planned and possible extensions in LOCKSS that would enhance diagnosis and auditing.

You may also be interested in the other presentations from the workshop, which are posted on the PLN2012 Website.