Jul 25, 4:49pm

This presentation, invited for a workshop  on data preservation for open science,  held at JCDL 2013, gives a brief tour of a large topic — how do we understand the types of data and software used in social science research. In this presentation I characterize the intellectual landscape  across 9 dimensions of data structure, content, measure, and use. I then use this framework to characterize  three interesting use cases.

This illustrates some particular challenges for long-term access to and replication of social science research, including the use of “messy” human sensors; the wide mix of data types, structures, sparsity; complex legal constraints; pervasive use of manual and computer-aided coding; use of niche commercial software and bespoke software; and very long-term access requirements.

Jul 19, 5:43pm

MIT has a wonderful tradition of offering a variety of short courses during the Winter hiatus, known as IAP. The MIT Libraries generally offers dozens of courses on data and information management (among other topics) during this time. Moreover, the Libraries also hold bonus IAP sessions in April and July.

So, for this year’s JulyAP, I updated a tutorial on managing confidential data, continuing to integrate the things we’ve learned in our privacy research project. This short course focuses on the practical side of managing confidential research data, but gives some pointers to the research areas.

Tslides are below:

Jul 07, 9:23am

My collaborator Micheal McDonald and I are now analyzing the data that resulted from the crowd-sourcing participative electoral mapping projects we were involved in and others. Our earlier article, analyzing redistricting in Virginia established that  members of the public are capable of creating legal redistricting plans, and in many ways perform better than the legislature in doing so.

Our latest analysis, baed on data from Congressional redistricting in Florida, reinforces this finding. Furthermore, it highlights some of the limits of reform efforts, and the structural tradeoffs among redistricting criteria.

The reforms to process in Florida, catalyzed by advances in information technology, enabled a dramatic increase in public participation in the redistricting process. This reform process in Florida can be considered a partial success: The adopted plan implements one the the most efficient observable trade-offs among the reformer’s criteria, primarily along the lines of racial representation by creating an additional Black-majority district in the form of the current 5th Congressional District. This does not mean, however, that reform was entirely successful. The adopted plan is efficient, but is atypical of the plans submitted by the legislature and public. Based on the pattern of public submissions, and on contextual information, we suspect the adopted plan was drawn for partisan motivations. The public preference and good-government criteria might be better served by the selection of the other efficient plans – that were much more competitive, and less biased, at the cost of a reduction of the majority-minority seat.

Most of these trends can be made clear through information visualization methods. The figure below draws a line representing the Pareto (efficient) frontier in two dimensions to illustrate the major trade-offs between number of partisan seats and other criteria.

One of the visuals is below — it shows the tradeoffs between partisan advantage and different representational criteria, based on the pareto-frontier of plans publicly available.

(Click on the image below to enlarge)


These frontiers suggest that some criteria are more constraining on Democratic redistricters than on Republican redistricters. On the one hand   equipopulation is equally constraining on Democratic and Republic partisan seat creatio. However, the data suggests a structural trade-off  between and Black-majority seats and Democratic seats.

More details on the data, evaluations, etc. appear in the article (uncorrected manuscript).

Jul 02, 11:01am

The structure and design of digital storage systems is a cornerstone of digital preservation. To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This talk, presented as part of the the Program on Information Science seminar series, discusses findings from this survey, common gaps, and trends in this area.

(I also have a little fun highlighting the hidden assumptions underlying Amazon Glacier’s reliability claims. For more on that see this earlier post.)

Jun 26, 3:11pm

Three weeks ago, NIH issued a request for information to solicit comments on the development  an NIH Data Catalog as part of its overall Big Data to Knowledge (BD2K) Initiative.

The Data Preservation Alliance for Social Sciences issued a response to which I contributed. Two sections are of general interest to the library/stewardship community:

Common Data Citation Principles and Practices


While there are a number of different communities of practice around data citation, a number of common principles and practices can be identified. 


The editorial policy Science [see ] is an exemplar of two principles  for data citations: First, that published claims should cite the evidence and methods upon which they rely, and second, that things cited should be available for examination by the scientific community.  These principles have been recognized across a set of communities and expert reports, and are increasingly being adopted by number of other leading journals. [See Altman 2012; CODATA-ICSTI Task Group, 2013; and]


Implementation Considerations


Previous policies aiming to facilitate open access to research data have often failed to achieve their promise in implementation. Effective implementation requires standardizing core practices, aligning stakeholder incentives, reducing barriers to long-term access, and building in evaluation mechanisms.


A set of core recognized good practices have emerged that span fields. Good practice includes separating the elements of citation from the presentation; including in the elements identifier, title, author, and date information, and where at all possible version and fixity information; and listing data citations in the same place as citation to other works – typically in the references section. [See Altman-King 2006; Altman 2012; CODATA-ICTI Task Group 2013; ; ]


Although the incentives related to data citation and access are complex, there are a number of simple points of leverage. First, journals can both create positive incentives for sharing data by requiring that data be properly cited. Second, funders can require that only those outputs of research that comply with access and citation policies can be claimed as results from prior research.

You may read the full response on the Data-PASS site. 

Jun 20, 10:31am

Metdata can be defined variously as “data about data”, digital ‘breadcrumbs’, magic pixie dust, and “something that everyone now knows the NSA wants a lot of”. It’s all of the above.

Metadata is used to support decision and workflows, add value to objects (through enhancing discover, use, reuse, and integration), and to support evaluation and analysis. It’s not the whole story for any of these things, but it can be a big part.

This presentation, invited for a workshop on Open Access and Scholarly Books (sponsored by the Berkman Center and Knowledge Unlatched),  provides a very brief overview of metadata design principles, approaches to evaluation metrics, and some relevant standards and exemplars in scholarly publishing. It is intended to provoke discussion on approaches to evaluation of the use, characteristics, and value of OA publications.

Jun 11, 11:51am

Best practices aren’t.

The core issue is that there are few models for the systematic valuation of data:  We have no robust general proven ways of answering the question of how much  data X be worth to community Y at time Z. Thus the “bestness” (optimality) of practices are generally strongly dependent on operational context.. and the context of data sharing is currently both highly complex and dynamic Until there is systematic descriptive evidence that best practices are used, predictive evidence that best practices are associated with future desired outcomes, and causal evidence that the application of best practices yields improved outcomes, we will be unsure that practices are “best”.

Nevertheless, one should use established “not-bad” practices, for a number of reasons. First, to avoid practices that are clearly bad; second, because use of such practices acts to document operational and tacit knowledge; third because selecting practices can help to elicit the underlying assumptions under which practices are applied; and finally because not-bad practcies provide a basis for auditing, evaluation, and eventual improvement.

Specific not-bad practices for data sharing fall into roughly three categories :

  • Analytic practices: lifecycle analysis & requirements analysis
  • Policy practices for: data dissemination, licensing, privacy, availability, citation and reproducibility
  • Technical practices for sharing and reproducibility, including fixity, replication, provenance

This presentation at the Second Open Economics International Workshop (sponsored by the Sloan Foundation, MIT and OKFN) provides an overview of these and links to specific practices recommendations, standards, and tools:

Jan 10, 4:37pm

MIT has a wonderful tradition of offering a variety of short courses during the Winter hiatus between semesters, known as IAP (Independent Activities Period). These range from how-to sessions, forums, athletic endeavors, lecture series, films, tours, recitals and contests. The MIT Libraries are offering dozens of courses on data and information management (among other topics) — I participated in a roundtable session on data management.

IAP period seems like an opportunity pass on some of the invisible knowledge of the academy; things like project management for science; managing bibliographies; care and feeding of professional networks; maintaining your tenure file; responding to reviewers; turning a dissertation into a book; communicating your work to the public & media; or writing compelling proposals.

So, for this year’s session, I updated my long-running “Getting Funding for your Research Course” with new resources, statistics, and MIT-specific information. This short course focuses on the area of communicating research projects and ideas in the form of proposals for support. The slides are below:

I aim to convert this to a webinar this year. Also, many of the main points are summarized in an article I’d written a few years ago, “Funding, Funding“.

Jan 08, 1:33pm

Yes…at least in Virginia.

My collaborator Micheal McDonald and I are now just catching up to analyzing the data that resulted from the crowd-sourcing participative electoral mapping projects we were involved in. The results were surprisingly clear and consistent:

  • Students are quite capable of creating legal districting plans.
  • Student plans generally demonstrated a wider range of possibilities as compared to legislative plans.
  • The ‘best’ plan, as ranked by each individual criterion, was a student plan.
  • The student plans covered a larger set of possible tradeoffs among each criterion.
  • Student plans were generally better on pairs of criteria. 
  • Student plans were more competitive and had more partisan balance than any of the adopted plans. 

Most of these trends can be made clear through information visualization methods.

One of the visuals is below — it shows how each of the plans scored on different pairs of criteria. The blue elipses contain the student plans, the green ones contain the legislative plans, and the red lines show the commission plans. A tiny ‘A’ shows the adopted plan. In each mini-plot the top-right corner is where the theoretically best scores are.

(Click on the image below to enlarge)


Notice that blue elipses are almost always bigger, contain most of the green elipses (or at least the best-scoring parts) and extend further toward to the top-right corner. Go students!

More details on the data, evaluations, etc. in the article (pre-copy-edited version).

Dec 25, 12:15pm

The core ideas around distributed digital preservation are well-recognized in the library and archival communities. Geographically distributed replication of content has been clearly identified for at least a decade as a best practice for management of digital content, to which one must provide long-term access. The idea of using replication to reduce the risk of loss was explicit in the design of LOCKSS a decade ago. More recently, replication has been incorporated into best practices and standards. In order to be fully “trusted”, an organization must have a managed process for creating, maintaining, and verifying multiple geographically distributed copies of its collections — and this is now recognized as good community practice. Further, this requirement has been incorporated in Trustworthy Repositories Audit & Certification (TRAC) (see section C.1.2-C.1. in [8]), and in the newly released ISO standard that has followed it.

However, while geographic replication can be used to mitigate some risks (e.g. hardware failure), less technical risks such as curatorial error, internal malfeasance, economic failure, and organizational failure require that replications be diversified across distributed, collaborative organizations. The LOCKSS research team has also identified a taxonomy of potential single-points-of-failure (highly correlated risks), that at minimum, a trustworthy preservation system should mitigate against. These risks include media failure, hardware failure, software failure, communication errors, network failure, media and hardware obsolescence, software obsolescence, operator error, natural disaster, external attack, internal attack, economic failure and organizational failure [9].

There are a number of existing organizational/institutional efforts to mitigate threats to the digital scholarly record (and scientific evidence base) through replication of content. Of these, the longest continuously-running organization is LOCKSS. (Currently, the LOCKSS organization manages the Global LOCKSS Network, in which over one hundred libraries collaborate to replicating over 9000 e-journal titles from over five hundred publishers. Furthermore, there are ten different membership organizations currently using the LOCKSS system to host their own separate replication networks.) Each of these membership organizations (or quasi-organizations) represent a variation on the collaborative preservation model, comprises separate institutional members, and targets a different set of content.

Auditing is especially challenging for distributed digital preservation, and essential for four reasons:

1. Replication can prevent long-term loss of content only when loss or corruption of a copy is detected and repaired using other replicates — this is a form of low-level auditing. Without detection and repair, it is a statistical near-certainty that content will be lost to non-malicious threats. Surprisingly, one reduces risks far more effectively by having a few replicates and verifying their fixity very frequently, than one does by having less frequent auditing and more replicas — at least for the types of threats that affect individual (or small groups of) content objects at random (e.g. many forms of result of media failure, hardware, software, and curatorial error).

2. The point of the replication enterprise is recovery. Regular restore/recovery audits that test the ability to restore the entire collection or randomly selected collections/items are considered good practice even to establish the reliability of short term backups and are required by IT disaster-recovery planning frameworks. Archival recovery is an even harder problem because one needs to validate not only that a set of objects are recoverable, but that the collection recovered also contains sufficient metadata and contextual information to remain interpretable! A demonstration of the difficulty of this was the AIHT exercise sponsored by the Library of Congress, which demonstrated that many collections thought to be substantially “complete” could not be successfully re-ingested (i.e. recovered) by another archive even in the absence of bit-level failures. [28,29] Because DPN is planned as a dark repository, recovery must be demonstrated within the system — it cannot be demonstrated through active use.

3. Transparent, regular, and systematic auditing is one of the few available safeguards against insider/internal threats.

4. Auditing of compliance with higher-level replication policies is recognized as essential for managing risks generally . For scalability, auditing of policies that apply to the individual management of collections must be automated. In order to automate these policies, there must be a transparent and reliable mapping from the higher level policies governing collections and their contents to the information and controls provided by technical infrastructure mapped.

This presentation, delivered at CNI 2012, summarizes the lessons learned from trial audits of several production distributed digital preservation networks. These audits were conducted using the open source SafeArchive system, which enables automated auditing of a selection of TRAC criteria related to replication and storage. An analysis of the trial audits demonstrates both the complexities of auditing modern replicated storage networks, and reveals common gaps between archival policy and practice. Recommendations for closing these gaps are discussed, as are extensions that have been added to the SafeArchive system to mitigate risks in distributed digital preservation (DDP).