Jun 11, 11:51am

Best practices aren’t.

The core issue is that there are few models for the systematic valuation of data:  We have no robust general proven ways of answering the question of how much  data X be worth to community Y at time Z. Thus the “bestness” (optimality) of practices are generally strongly dependent on operational context.. and the context of data sharing is currently both highly complex and dynamic Until there is systematic descriptive evidence that best practices are used, predictive evidence that best practices are associated with future desired outcomes, and causal evidence that the application of best practices yields improved outcomes, we will be unsure that practices are “best”.

Nevertheless, one should use established “not-bad” practices, for a number of reasons. First, to avoid practices that are clearly bad; second, because use of such practices acts to document operational and tacit knowledge; third because selecting practices can help to elicit the underlying assumptions under which practices are applied; and finally because not-bad practcies provide a basis for auditing, evaluation, and eventual improvement.

Specific not-bad practices for data sharing fall into roughly three categories :

  • Analytic practices: lifecycle analysis & requirements analysis
  • Policy practices for: data dissemination, licensing, privacy, availability, citation and reproducibility
  • Technical practices for sharing and reproducibility, including fixity, replication, provenance

This presentation at the Second Open Economics International Workshop (sponsored by the Sloan Foundation, MIT and OKFN) provides an overview of these and links to specific practices recommendations, standards, and tools:

Jan 10, 4:37pm

MIT has a wonderful tradition of offering a variety of short courses during the Winter hiatus between semesters, known as IAP (Independent Activities Period). These range from how-to sessions, forums, athletic endeavors, lecture series, films, tours, recitals and contests. The MIT Libraries are offering dozens of courses on data and information management (among other topics) — I participated in a roundtable session on data management.

IAP period seems like an opportunity pass on some of the invisible knowledge of the academy; things like project management for science; managing bibliographies; care and feeding of professional networks; maintaining your tenure file; responding to reviewers; turning a dissertation into a book; communicating your work to the public & media; or writing compelling proposals.

So, for this year’s session, I updated my long-running “Getting Funding for your Research Course” with new resources, statistics, and MIT-specific information. This short course focuses on the area of communicating research projects and ideas in the form of proposals for support. The slides are below:

I aim to convert this to a webinar this year. Also, many of the main points are summarized in an article I’d written a few years ago, “Funding, Funding“.

Jan 08, 1:33pm

Yes…at least in Virginia.

My collaborator Micheal McDonald and I are now just catching up to analyzing the data that resulted from the crowd-sourcing participative electoral mapping projects we were involved in. The results were surprisingly clear and consistent:

  • Students are quite capable of creating legal districting plans.
  • Student plans generally demonstrated a wider range of possibilities as compared to legislative plans.
  • The ‘best’ plan, as ranked by each individual criterion, was a student plan.
  • The student plans covered a larger set of possible tradeoffs among each criterion.
  • Student plans were generally better on pairs of criteria. 
  • Student plans were more competitive and had more partisan balance than any of the adopted plans. 

Most of these trends can be made clear through information visualization methods.

One of the visuals is below — it shows how each of the plans scored on different pairs of criteria. The blue elipses contain the student plans, the green ones contain the legislative plans, and the red lines show the commission plans. A tiny ‘A’ shows the adopted plan. In each mini-plot the top-right corner is where the theoretically best scores are.

(Click on the image below to enlarge)


Notice that blue elipses are almost always bigger, contain most of the green elipses (or at least the best-scoring parts) and extend further toward to the top-right corner. Go students!

More details on the data, evaluations, etc. in the article (pre-copy-edited version).

Dec 25, 12:15pm

The core ideas around distributed digital preservation are well-recognized in the library and archival communities. Geographically distributed replication of content has been clearly identified for at least a decade as a best practice for management of digital content, to which one must provide long-term access. The idea of using replication to reduce the risk of loss was explicit in the design of LOCKSS a decade ago. More recently, replication has been incorporated into best practices and standards. In order to be fully “trusted”, an organization must have a managed process for creating, maintaining, and verifying multiple geographically distributed copies of its collections — and this is now recognized as good community practice. Further, this requirement has been incorporated in Trustworthy Repositories Audit & Certification (TRAC) (see section C.1.2-C.1. in [8]), and in the newly released ISO standard that has followed it.

However, while geographic replication can be used to mitigate some risks (e.g. hardware failure), less technical risks such as curatorial error, internal malfeasance, economic failure, and organizational failure require that replications be diversified across distributed, collaborative organizations. The LOCKSS research team has also identified a taxonomy of potential single-points-of-failure (highly correlated risks), that at minimum, a trustworthy preservation system should mitigate against. These risks include media failure, hardware failure, software failure, communication errors, network failure, media and hardware obsolescence, software obsolescence, operator error, natural disaster, external attack, internal attack, economic failure and organizational failure [9].

There are a number of existing organizational/institutional efforts to mitigate threats to the digital scholarly record (and scientific evidence base) through replication of content. Of these, the longest continuously-running organization is LOCKSS. (Currently, the LOCKSS organization manages the Global LOCKSS Network, in which over one hundred libraries collaborate to replicating over 9000 e-journal titles from over five hundred publishers. Furthermore, there are ten different membership organizations currently using the LOCKSS system to host their own separate replication networks.) Each of these membership organizations (or quasi-organizations) represent a variation on the collaborative preservation model, comprises separate institutional members, and targets a different set of content.

Auditing is especially challenging for distributed digital preservation, and essential for four reasons:

1. Replication can prevent long-term loss of content only when loss or corruption of a copy is detected and repaired using other replicates — this is a form of low-level auditing. Without detection and repair, it is a statistical near-certainty that content will be lost to non-malicious threats. Surprisingly, one reduces risks far more effectively by having a few replicates and verifying their fixity very frequently, than one does by having less frequent auditing and more replicas — at least for the types of threats that affect individual (or small groups of) content objects at random (e.g. many forms of result of media failure, hardware, software, and curatorial error).

2. The point of the replication enterprise is recovery. Regular restore/recovery audits that test the ability to restore the entire collection or randomly selected collections/items are considered good practice even to establish the reliability of short term backups and are required by IT disaster-recovery planning frameworks. Archival recovery is an even harder problem because one needs to validate not only that a set of objects are recoverable, but that the collection recovered also contains sufficient metadata and contextual information to remain interpretable! A demonstration of the difficulty of this was the AIHT exercise sponsored by the Library of Congress, which demonstrated that many collections thought to be substantially “complete” could not be successfully re-ingested (i.e. recovered) by another archive even in the absence of bit-level failures. [28,29] Because DPN is planned as a dark repository, recovery must be demonstrated within the system — it cannot be demonstrated through active use.

3. Transparent, regular, and systematic auditing is one of the few available safeguards against insider/internal threats.

4. Auditing of compliance with higher-level replication policies is recognized as essential for managing risks generally . For scalability, auditing of policies that apply to the individual management of collections must be automated. In order to automate these policies, there must be a transparent and reliable mapping from the higher level policies governing collections and their contents to the information and controls provided by technical infrastructure mapped.

This presentation, delivered at CNI 2012, summarizes the lessons learned from trial audits of several production distributed digital preservation networks. These audits were conducted using the open source SafeArchive system, which enables automated auditing of a selection of TRAC criteria related to replication and storage. An analysis of the trial audits demonstrates both the complexities of auditing modern replicated storage networks, and reveals common gaps between archival policy and practice. Recommendations for closing these gaps are discussed, as are extensions that have been added to the SafeArchive system to mitigate risks in distributed digital preservation (DDP).

Nov 15, 1:07pm

Amazon recently announced integration of their core S3 service with their low-cost storage system, Glacier. This facilitates the ability to add rules to S3 (or their reduced redundancy store) based on age, date, and S3 bucket prefix.

Regular incremental improvement and integration is a signature of Amazon’s modus operandi for its cloud services: Amazon has a pattern of announcing updates every few weeks that add services, integrate existing services, or (occasionally) lower prices. And they have introduced incremental improvements to the AWS platform over a dozen times since Glacier was announced at the end of August.

Interestingly, Glacier is an apt metaphor for this low-cost service in that it not only signifies “cold” storage, but also signals a massive object making its way slowly but inexorably across a space, covering everything in its path.

Why Glacier is Important

Why is Glacier important? First, as James Hamilton (disclosure, James is VP and Distinguished Engineer at Amazon) aptly summarizes, Glacier provides the volume economics for multi-site, replicated, cold storage (near-line) to small-scale and medium-scale users. While do-it-yourself solutions based on automated tape libraries can still beat Glacier’s price by a huge margin, the sweet spot for this approach has been shifted out so that only very large enterprises are likely to beat the price on Glacier by rolling out their own solutions using tape libraries, etc.

Second, many businesses, and also many services, are built upon or backed up through AWS and S3. Amazon’s continued integration of Glacier into AWS will make it increasingly straightforward to integrate low-cost cold-storage replication into preservation services such as DuraCloud, backup services such as Zmanda, and even into simple software tools like Cyberduck.

Overall, I’m optimistic that this is a Good Thing, and will improve the likelihood of meaningful future access to digital content. However, there are a number of substantial issues to keep in mind when considering Glacier as part of a digital preservation solution.

Issue 1. Technical infrastructure does not guarantee long term durability

Although some commenters have stated that Glacier will “probably outlive us all“, these claims are based on little evidence. The durability of institutions and services relies as much upon economic models, business models, organizational models, and organizational mission as upon technology. Based on the history of technology companies, one must consider that there is a substantial probability that Amazon itself will not be in existence in fifty years, and the future existence of any specific Amazon service is even more doubtful.

Issue 2. Lock-in and future cost projections

As Wired dramatically illustrated, the costs of retrieving all of one’s data from Glacier can be quite substantial. Further, as David Rosenthal has repeatedly pointed out, the long-term cost-competitiveness of preservation services depends “not on their initial pricing, but on how closely their pricing tracks the Kryder’s Law decrease in storage media costs”. And that “It is anyone’s guess how quickly Amazon will drop Glacier’s prices as the underlying storage media costs drop.” The importance of this future price-uncertainty is magnified by the degree of lock-in exhibited by the Glacier service.

Issue 3. Correlated failures

Amazon claims a ‘design reliability’ of ‘99.999999999%‘. This appears to be an extremely optimistic number without any formal published analysis backing it. The number appears to be based on a projection of theoretical failure rates for storage hardware (and such rates are  wildly optimistic under production conditions), together with the (unrealistic) assumption that all such failures are statistically independent.  Moreover, this ‘design reliability’ claim is unsupported (at time of writing) by Glacier’s terms of service, SLA, or customer agreement. To the contrary, the agreements appear to indemnify Amazon against any loss of damage, does not appear to offer a separate SLA for Glacier, and limits recovery under existing SLA’s (for services, such as S3) to refund of fees for periods the service was unavailable. If Amazon were highly confident in the applicability of the quoted ‘design reliability’ to production settings, one might expect a stronger SLA. However, despite these caveats, my guess is that Glacier’s will still turn out to be, in practice, substantially more reliable than the DIY solutions that most individual organizations can afford to implement entirely in-house.

Nevertheless, as previously discussed (most recently at Digital Preservation 2012), a large part of risk mitigation for digital assets is to diversify against sources of correlated failure. Although implementation details are not complete, Glacier does appear to diversify against some common risks to bits — primarily media failure, hardware failure, and localized natural disaster (such as fire, flood). This is good, but far from complete. A number of likely single-point (or highly correlated) vulnerabilities remain, including software failure (e.g. a bug in the AWS software for its control backplane might result in permanent loss that would go undetected for a substantial time; or cause other cascading failures — analogous to those we’ve seen previously); legal threats (leading to account lock-out — such as this, deletion, or content removal); or other institutional threats (such as a change in Amazon’s business model). It is critical that diversification against these additional failures be incorporated into a digital preservation strategy.

Preliminary Recommendations

To sum up, Glacier is an important service, and appears to be a solid option for cold storage, but institutions that are responsible for digital preservation and long-term access should not use the quoted design reliability in modeling likelihood of loss, nor rely on Glacier as the sole archival mechanism for their content.

Nov 14, 5:04pm

Lately, our DistrictBuilder software, a tool that allows people to easily participate in creating election districts, has gotten some additional attention. We recently received an Outstanding Software Development Award from the American Political Science Association (given by the Information Technology & Politics Section) and a Data Innovation Award given by the O’Reilly Strata Conference (for data with social impact). And just last week, we had the opportunity to present our work to the government of Mexico at the invitation of the Instituto Federal Electoral, as part of their International Colloquium on Redistricting.

During this presentation, I was able to reflect on the interplay of algorithms and public participation. and it became even clearer to me that applications like DistrictBuilder exemplify the ability of information science to improve policy and politics.

Redistricting in Mexico is particularly interesting, since it relies heavily on facially neutral geo-demographic criteria and optimization algorithms, which represents a different sort of contribution from information science. Thus, it was particularly interesting to me to consider the interplay between algorithmic approaches to problem solving and “wisdom of crowd” approaches, especially for problems in the public sphere.

It’s clear that complex optimization algorithms are an advance in redistricting in Mexico, and have an important role in public policy. However, they also have a number of limitations:

  • Algorithmic optimization solutions often depend on a choice of (theoretically arbitrary) ‘starting values’ from which the algorithm starts its search for a solution.
  • Quality algorithmic solutions typically rely on accurate input data.
  • Many optimization algorithms embed particular criteria or particular constraints into the algorithm itself.
  • Even where optimization algorithms are nominally agnostic to the criteria used for the goal, some criteria are more tractable than others; and some are more tractable for particular algorithms.
  • In many cases, when an algorithm yields a solution, we don’t know exactly (or even approximately, in any formal sense) how good that solution is.

I argue that explicitly incorporating a human element is important for algorithmic solutions in the public sphere. In particular:

  • Use open documentation and open (non-patented, or open-licensed) to enable external replication of algorithms.
  • Use open source to enable external verification of the implementation of particular algorithms.
  • Incorporate public input to improve the data (especially describing local communities and circumstances) in algorithm driven policies.
  • Incorporate crowd-sourced solutions as candidate “starting values” for further algorithmic refinement.
  • Subject algorithmic output to crowd-sourced public review to verify the quality of the solutions produced.

You can see the slides, which include more detail and references below. For much such slides, refer to our PublicMapping project site.

Oct 25, 5:40pm

The workshop report from the UNC Curating for Data Quality workshop, in which I was delighted to participate, is now being made available. It contains many perspectives addressing a number of questions:

Data Quality Criteria and Contexts. What are the characteristics of data quality? What threats to data quality arise at different stages of the data life cycle? What kinds of work processes affect data quality? What elements of the curatiorial process most strongly affect data quality over time? How do data types and contexts influence data quality parameters?

Human and Institutional Factors. What are the costs associated with different levels of data quality? What kinds of incentives and constraints influence efforts of different stakeholders? How does one estimate the continuum from critical to tolerable errors? How often does one need to validate data?

Tools for Effective and Painless Curation. What kinds of tools and techniques exist or are required to insure that creators and curators address data quality?

Metrics. What are or should be the measures of data quality? How does one identify errors? How does one correct errors or mitigate their effects?

My current perspective, after reflecting on seven ‘quality’ frameworks from different disciplines that differ in complex and deep ways, is that the data quality criteria implied by the candidate frameworks are neither easily harmonized, nor readily quantified. Thus, a generalized systematic approach to evaluating data quality seems unlikely to emerge soon. Fortunately, developing an effective approach to digital curation that respects data quality does not require a comprehensive definition of data quality. Instead, we can appropriately address “data quality” in curation by limiting our consideration to a narrower applied questions:

Which aspects of data quality are (potentially) affected by (each stage of) digital curation activity? And how do we keep invariant data quality properties at each curation stage?

A number of approaches suggest seem particularly likely to bear fruit:

  1. Incorporate portfolio diversification in selection and appraisal.
  2. Support validation of preservation quality attributes such as authenticity, integrity, organization, and chain of custody throughout long-term preservation and use — from ingest through delivery and creation of derivative works.
  3. Apply semantic fingerprints for quality evaluation during ingest, format migration and delivery.

These approaches have the advantage of being independent of the content subject area, of the domain of measure, and of the particular semantics content of objects and collections. Therefore, they are broadly applicable. By mitigating these broad-spectrum threats to quality, we can improve the overall quality of curated collections, and their expected value to target communities.

My extended thoughts are here:

You may also be interested in the other presentations from the workshop, which are posted on the Conference Site.

Oct 23, 12:37pm

I was pleased to participate in the 2012 PLN Community Meeting.

Over the last decade, replication has become a required practice for digital preservation. Now, Distributed Digital Preservation (DDP) networks are emerging as a vital strategy to ensure long-term access to the scientific evidence base and cultural heritage. A number of DDP networks are currently in production, including CLOCKSS, Data-PASS, MetaArchive, COPPUL, Lukll, PeDALS, Synergies, Data One, and new networks, such as DFC and DPN are being developed.

These networks were created to mitigate the risk of content loss by diversifying across software architectures, organizational structures, geographic regions, as well as legal, political, and economic environments. And many of these networks have been successful at replicating a diverse set of content.

However, the point of the replication enterprise is recovery. Archival recovery is an even harder problem because one needs to validate not only that a set of objects is recoverable, but also that the collection recovered also contains sufficient metadata and contextual information to remain interpretable! A demonstration of the difficulty of this was the AIHT exercise sponsored by the Library of Congress, which demonstrated that many collections thought to be substantially “complete” could not be successfully re-ingested (i.e. recovered) by another archive, even in the absence of bit-level failures.

In a presentation co-authored with Jonathan Crabtree, we summarized some lessons learned from trial audits of several production distributed digital preservation networks. These audits were conducted using the open source SafeArchive system (, which enables automated auditing of a selection of TRAC criteria related to replication and storage. An analysis of the trial audits demonstrates both the complexities of auditing modern replicated storage networks, and reveals common gaps between archival policy and practice. It also reveals gaps in the auditing tools we have available. Our presentation, below, focused on the importance of designing auditing systems to provide diagnostic information that can be used to diagnose non-confirmations of audited policies. Tom Lipkis followed with specific planned and possible extensions in LOCKSS that would enhance diagnosis and auditing.

You may also be interested in the other presentations from the workshop, which are posted on the PLN2012 Website.

Sep 27, 2:27pm

I was pleased to participate in the Niso Forum on Tracking it Back to the Source: Managing and Citing Research Data. 

A principled approach to data management involves modeling information through the lifecycle to assess stakeholder requirements at each stage, and then tracking management, use and impact of that information.

One of the complexities that lifecycle modeling reveals is the variety of different goals that are associated with data management – including orchestrating data for current use; protecting against disclosure; complying with contracts, regulation, law and policy; maximizing the overall value of held information assets; and ensuring short and long-term dissemination.

The most challenging aspects of data management are often associated with management across stages and among different actors. A number of tools and methods provide leverage, including:

  • Identifier systems – identification of information objects and actors, and structured use of these identifiers (identifiers, references, citations).
  • Metadata and tools (such as RCS or VCS) for tracking provenance — the relationship of delivered data to the history of inputs and modifications, and the actors responsible for these.
  • Systems and methods for validating authenticity and chain of custody — assertions about the provenance and ownership (respectively) of information.
  • Systems and methods for auditing — verification of asserted system properties, and policy compliance

The presentation further examines data management and data citation from an information lifecycle approach:

You may also be interested in the other presentations from the workshop, which will soon be posted on the NISO Forum website.

Jul 31, 11:10am

My colleagues from the National Digital Stewardship Alliance working group on Infrastructure and I were pleased to lead a session on bit-level preservation at the 2012 annual Digital Preservation conference, hosted by the Library of congress.

Bit-Level preservation is far from a solved problem. While we know roughly what transformations and processes can be used to mitigate risk from major threats, there is a considerable (and largely applied) research to be done to determine optimal/cost-effective levels  and strategies for  replication, diversification, compression, auditing; and to develop better (more reliable & valid) measures of risk.

The talk summarized the major risk factors and mitigation strategies, and noted some inter-relationships:

You may also be interested in other presentations from the conference. Bill LeFurgy has an informative blog post with highlights.

David Weinberger’s talk was particularly provocative, drawing on themes from his recent book Too Big to Know.  He claims, essentially, that the increase in data, and even more, the networking of information and people, changes the nature of knowledge itself. Knowledge has constituted a series of stopping points (authoritative source, authoritative texts);  a corresponding set of institutions and practices to “filter out” bad information; and a physical representation constrained by the form, length, and relative unalterability of printed books. Now, Weinberger claims knowledge is increasing a profess of filtering  forward — of provide summaries and links to a much larger and more dynamic knowledge base. This redefines knowledge, changes the role of institutions (which cannot hope to contain all knowledge in an area), and implies that (a) filters are content; (b) we are forced into awareness of the contingent and limited nature of filters — there is always bad information and contradictory information available. Changes in knowledge also changes the nature of expertise and science — both becoming less hierarchical, more diverse, more linked.

If Weinberger is right, and I suspect he is (in large part), there are undiscussed implications for digital preservation. First, our estimates of the expected value of long-term access should be going up if the overall value of knowledge is increased by the total context of knowledge available. Second, we need to go beyond preserving individual information objects, or even “complete” collections — value resides in the network as a whole, and in the filters being used. Maintaining our cultural heritage and scientific evidence base requires enabling historic access to this dynamic network of informations.