Blog

Jul 19, 5:43pm

MIT has a wonderful tradition of offering a variety of short courses during the Winter hiatus, known as IAP. The MIT Libraries generally offers dozens of courses on data and information management (among other topics) during this time. Moreover, the Libraries also hold bonus IAP sessions in April and July.

So, for this year’s JulyAP, I updated a tutorial on managing confidential data, continuing to integrate the things we’ve learned in our privacy research project. This short course focuses on the practical side of managing confidential research data, but gives some pointers to the research areas.

Tslides are below:


Jul 07, 9:23am

My collaborator Micheal McDonald and I are now analyzing the data that resulted from the crowd-sourcing participative electoral mapping projects we were involved in and others. Our earlier article, analyzing redistricting in Virginia established that  members of the public are capable of creating legal redistricting plans, and in many ways perform better than the legislature in doing so.

Our latest analysis, baed on data from Congressional redistricting in Florida, reinforces this finding. Furthermore, it highlights some of the limits of reform efforts, and the structural tradeoffs among redistricting criteria.

The reforms to process in Florida, catalyzed by advances in information technology, enabled a dramatic increase in public participation in the redistricting process. This reform process in Florida can be considered a partial success: The adopted plan implements one the the most efficient observable trade-offs among the reformer’s criteria, primarily along the lines of racial representation by creating an additional Black-majority district in the form of the current 5th Congressional District. This does not mean, however, that reform was entirely successful. The adopted plan is efficient, but is atypical of the plans submitted by the legislature and public. Based on the pattern of public submissions, and on contextual information, we suspect the adopted plan was drawn for partisan motivations. The public preference and good-government criteria might be better served by the selection of the other efficient plans – that were much more competitive, and less biased, at the cost of a reduction of the majority-minority seat.

Most of these trends can be made clear through information visualization methods. The figure below draws a line representing the Pareto (efficient) frontier in two dimensions to illustrate the major trade-offs between number of partisan seats and other criteria.

One of the visuals is below — it shows the tradeoffs between partisan advantage and different representational criteria, based on the pareto-frontier of plans publicly available.

(Click on the image below to enlarge)

altman_mcdonald_fl

These frontiers suggest that some criteria are more constraining on Democratic redistricters than on Republican redistricters. On the one hand   equipopulation is equally constraining on Democratic and Republic partisan seat creatio. However, the data suggests a structural trade-off  between and Black-majority seats and Democratic seats.

More details on the data, evaluations, etc. appear in the article (uncorrected manuscript).


Jul 02, 11:01am

The structure and design of digital storage systems is a cornerstone of digital preservation. To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This talk, presented as part of the the Program on Information Science seminar series, discusses findings from this survey, common gaps, and trends in this area.

(I also have a little fun highlighting the hidden assumptions underlying Amazon Glacier’s reliability claims. For more on that see this earlier post.)


Jun 26, 3:11pm

Three weeks ago, NIH issued a request for information to solicit comments on the development  an NIH Data Catalog as part of its overall Big Data to Knowledge (BD2K) Initiative.

The Data Preservation Alliance for Social Sciences issued a response to which I contributed. Two sections are of general interest to the library/stewardship community:

Common Data Citation Principles and Practices

 

While there are a number of different communities of practice around data citation, a number of common principles and practices can be identified. 

 

The editorial policy Science [see http://www.sciencemag.org/site/feature/contribinfo/prep/ ] is an exemplar of two principles  for data citations: First, that published claims should cite the evidence and methods upon which they rely, and second, that things cited should be available for examination by the scientific community.  These principles have been recognized across a set of communities and expert reports, and are increasingly being adopted by number of other leading journals. [See Altman 2012; CODATA-ICSTI Task Group, 2013; and http://www.force11.org/AmsterdamManifesto]

 

Implementation Considerations

 

Previous policies aiming to facilitate open access to research data have often failed to achieve their promise in implementation. Effective implementation requires standardizing core practices, aligning stakeholder incentives, reducing barriers to long-term access, and building in evaluation mechanisms.

 

A set of core recognized good practices have emerged that span fields. Good practice includes separating the elements of citation from the presentation; including in the elements identifier, title, author, and date information, and where at all possible version and fixity information; and listing data citations in the same place as citation to other works – typically in the references section. [See Altman-King 2006; Altman 2012; CODATA-ICTI Task Group 2013; http://schema.datacite.org/ ; http://data-pass.org/citations.html ]

 

Although the incentives related to data citation and access are complex, there are a number of simple points of leverage. First, journals can both create positive incentives for sharing data by requiring that data be properly cited. Second, funders can require that only those outputs of research that comply with access and citation policies can be claimed as results from prior research.

You may read the full response on the Data-PASS site. 


Jun 20, 10:31am

Metdata can be defined variously as “data about data”, digital ‘breadcrumbs’, magic pixie dust, and “something that everyone now knows the NSA wants a lot of”. It’s all of the above.

Metadata is used to support decision and workflows, add value to objects (through enhancing discover, use, reuse, and integration), and to support evaluation and analysis. It’s not the whole story for any of these things, but it can be a big part.

This presentation, invited for a workshop on Open Access and Scholarly Books (sponsored by the Berkman Center and Knowledge Unlatched),  provides a very brief overview of metadata design principles, approaches to evaluation metrics, and some relevant standards and exemplars in scholarly publishing. It is intended to provoke discussion on approaches to evaluation of the use, characteristics, and value of OA publications.


Jun 11, 11:51am

Best practices aren’t.

The core issue is that there are few models for the systematic valuation of data:  We have no robust general proven ways of answering the question of how much  data X be worth to community Y at time Z. Thus the “bestness” (optimality) of practices are generally strongly dependent on operational context.. and the context of data sharing is currently both highly complex and dynamic Until there is systematic descriptive evidence that best practices are used, predictive evidence that best practices are associated with future desired outcomes, and causal evidence that the application of best practices yields improved outcomes, we will be unsure that practices are “best”.

Nevertheless, one should use established “not-bad” practices, for a number of reasons. First, to avoid practices that are clearly bad; second, because use of such practices acts to document operational and tacit knowledge; third because selecting practices can help to elicit the underlying assumptions under which practices are applied; and finally because not-bad practcies provide a basis for auditing, evaluation, and eventual improvement.

Specific not-bad practices for data sharing fall into roughly three categories :

  • Analytic practices: lifecycle analysis & requirements analysis
  • Policy practices for: data dissemination, licensing, privacy, availability, citation and reproducibility
  • Technical practices for sharing and reproducibility, including fixity, replication, provenance

This presentation at the Second Open Economics International Workshop (sponsored by the Sloan Foundation, MIT and OKFN) provides an overview of these and links to specific practices recommendations, standards, and tools:


Jan 10, 4:37pm

MIT has a wonderful tradition of offering a variety of short courses during the Winter hiatus between semesters, known as IAP (Independent Activities Period). These range from how-to sessions, forums, athletic endeavors, lecture series, films, tours, recitals and contests. The MIT Libraries are offering dozens of courses on data and information management (among other topics) — I participated in a roundtable session on data management.

IAP period seems like an opportunity pass on some of the invisible knowledge of the academy; things like project management for science; managing bibliographies; care and feeding of professional networks; maintaining your tenure file; responding to reviewers; turning a dissertation into a book; communicating your work to the public & media; or writing compelling proposals.

So, for this year’s session, I updated my long-running “Getting Funding for your Research Course” with new resources, statistics, and MIT-specific information. This short course focuses on the area of communicating research projects and ideas in the form of proposals for support. The slides are below:

I aim to convert this to a webinar this year. Also, many of the main points are summarized in an article I’d written a few years ago, “Funding, Funding“.


Jan 08, 1:33pm

Yes…at least in Virginia.

My collaborator Micheal McDonald and I are now just catching up to analyzing the data that resulted from the crowd-sourcing participative electoral mapping projects we were involved in. The results were surprisingly clear and consistent:

  • Students are quite capable of creating legal districting plans.
  • Student plans generally demonstrated a wider range of possibilities as compared to legislative plans.
  • The ‘best’ plan, as ranked by each individual criterion, was a student plan.
  • The student plans covered a larger set of possible tradeoffs among each criterion.
  • Student plans were generally better on pairs of criteria. 
  • Student plans were more competitive and had more partisan balance than any of the adopted plans. 

Most of these trends can be made clear through information visualization methods.

One of the visuals is below — it shows how each of the plans scored on different pairs of criteria. The blue elipses contain the student plans, the green ones contain the legislative plans, and the red lines show the commission plans. A tiny ‘A’ shows the adopted plan. In each mini-plot the top-right corner is where the theoretically best scores are.

(Click on the image below to enlarge)

house

Notice that blue elipses are almost always bigger, contain most of the green elipses (or at least the best-scoring parts) and extend further toward to the top-right corner. Go students!

More details on the data, evaluations, etc. in the article (pre-copy-edited version).


Dec 25, 12:15pm

The core ideas around distributed digital preservation are well-recognized in the library and archival communities. Geographically distributed replication of content has been clearly identified for at least a decade as a best practice for management of digital content, to which one must provide long-term access. The idea of using replication to reduce the risk of loss was explicit in the design of LOCKSS a decade ago. More recently, replication has been incorporated into best practices and standards. In order to be fully “trusted”, an organization must have a managed process for creating, maintaining, and verifying multiple geographically distributed copies of its collections — and this is now recognized as good community practice. Further, this requirement has been incorporated in Trustworthy Repositories Audit & Certification (TRAC) (see section C.1.2-C.1. in [8]), and in the newly released ISO standard that has followed it.

However, while geographic replication can be used to mitigate some risks (e.g. hardware failure), less technical risks such as curatorial error, internal malfeasance, economic failure, and organizational failure require that replications be diversified across distributed, collaborative organizations. The LOCKSS research team has also identified a taxonomy of potential single-points-of-failure (highly correlated risks), that at minimum, a trustworthy preservation system should mitigate against. These risks include media failure, hardware failure, software failure, communication errors, network failure, media and hardware obsolescence, software obsolescence, operator error, natural disaster, external attack, internal attack, economic failure and organizational failure [9].

There are a number of existing organizational/institutional efforts to mitigate threats to the digital scholarly record (and scientific evidence base) through replication of content. Of these, the longest continuously-running organization is LOCKSS. (Currently, the LOCKSS organization manages the Global LOCKSS Network, in which over one hundred libraries collaborate to replicating over 9000 e-journal titles from over five hundred publishers. Furthermore, there are ten different membership organizations currently using the LOCKSS system to host their own separate replication networks.) Each of these membership organizations (or quasi-organizations) represent a variation on the collaborative preservation model, comprises separate institutional members, and targets a different set of content.

Auditing is especially challenging for distributed digital preservation, and essential for four reasons:

1. Replication can prevent long-term loss of content only when loss or corruption of a copy is detected and repaired using other replicates — this is a form of low-level auditing. Without detection and repair, it is a statistical near-certainty that content will be lost to non-malicious threats. Surprisingly, one reduces risks far more effectively by having a few replicates and verifying their fixity very frequently, than one does by having less frequent auditing and more replicas — at least for the types of threats that affect individual (or small groups of) content objects at random (e.g. many forms of result of media failure, hardware, software, and curatorial error).

2. The point of the replication enterprise is recovery. Regular restore/recovery audits that test the ability to restore the entire collection or randomly selected collections/items are considered good practice even to establish the reliability of short term backups and are required by IT disaster-recovery planning frameworks. Archival recovery is an even harder problem because one needs to validate not only that a set of objects are recoverable, but that the collection recovered also contains sufficient metadata and contextual information to remain interpretable! A demonstration of the difficulty of this was the AIHT exercise sponsored by the Library of Congress, which demonstrated that many collections thought to be substantially “complete” could not be successfully re-ingested (i.e. recovered) by another archive even in the absence of bit-level failures. [28,29] Because DPN is planned as a dark repository, recovery must be demonstrated within the system — it cannot be demonstrated through active use.

3. Transparent, regular, and systematic auditing is one of the few available safeguards against insider/internal threats.

4. Auditing of compliance with higher-level replication policies is recognized as essential for managing risks generally . For scalability, auditing of policies that apply to the individual management of collections must be automated. In order to automate these policies, there must be a transparent and reliable mapping from the higher level policies governing collections and their contents to the information and controls provided by technical infrastructure mapped.

This presentation, delivered at CNI 2012, summarizes the lessons learned from trial audits of several production distributed digital preservation networks. These audits were conducted using the open source SafeArchive system, which enables automated auditing of a selection of TRAC criteria related to replication and storage. An analysis of the trial audits demonstrates both the complexities of auditing modern replicated storage networks, and reveals common gaps between archival policy and practice. Recommendations for closing these gaps are discussed, as are extensions that have been added to the SafeArchive system to mitigate risks in distributed digital preservation (DDP).


Nov 15, 1:07pm

Amazon recently announced integration of their core S3 service with their low-cost storage system, Glacier. This facilitates the ability to add rules to S3 (or their reduced redundancy store) based on age, date, and S3 bucket prefix.

Regular incremental improvement and integration is a signature of Amazon’s modus operandi for its cloud services: Amazon has a pattern of announcing updates every few weeks that add services, integrate existing services, or (occasionally) lower prices. And they have introduced incremental improvements to the AWS platform over a dozen times since Glacier was announced at the end of August.

Interestingly, Glacier is an apt metaphor for this low-cost service in that it not only signifies “cold” storage, but also signals a massive object making its way slowly but inexorably across a space, covering everything in its path.

Why Glacier is Important

Why is Glacier important? First, as James Hamilton (disclosure, James is VP and Distinguished Engineer at Amazon) aptly summarizes, Glacier provides the volume economics for multi-site, replicated, cold storage (near-line) to small-scale and medium-scale users. While do-it-yourself solutions based on automated tape libraries can still beat Glacier’s price by a huge margin, the sweet spot for this approach has been shifted out so that only very large enterprises are likely to beat the price on Glacier by rolling out their own solutions using tape libraries, etc.

Second, many businesses, and also many services, are built upon or backed up through AWS and S3. Amazon’s continued integration of Glacier into AWS will make it increasingly straightforward to integrate low-cost cold-storage replication into preservation services such as DuraCloud, backup services such as Zmanda, and even into simple software tools like Cyberduck.

Overall, I’m optimistic that this is a Good Thing, and will improve the likelihood of meaningful future access to digital content. However, there are a number of substantial issues to keep in mind when considering Glacier as part of a digital preservation solution.

Issue 1. Technical infrastructure does not guarantee long term durability

Although some commenters have stated that Glacier will “probably outlive us all“, these claims are based on little evidence. The durability of institutions and services relies as much upon economic models, business models, organizational models, and organizational mission as upon technology. Based on the history of technology companies, one must consider that there is a substantial probability that Amazon itself will not be in existence in fifty years, and the future existence of any specific Amazon service is even more doubtful.

Issue 2. Lock-in and future cost projections

As Wired dramatically illustrated, the costs of retrieving all of one’s data from Glacier can be quite substantial. Further, as David Rosenthal has repeatedly pointed out, the long-term cost-competitiveness of preservation services depends “not on their initial pricing, but on how closely their pricing tracks the Kryder’s Law decrease in storage media costs”. And that “It is anyone’s guess how quickly Amazon will drop Glacier’s prices as the underlying storage media costs drop.” The importance of this future price-uncertainty is magnified by the degree of lock-in exhibited by the Glacier service.

Issue 3. Correlated failures

Amazon claims a ‘design reliability’ of ‘99.999999999%‘. This appears to be an extremely optimistic number without any formal published analysis backing it. The number appears to be based on a projection of theoretical failure rates for storage hardware (and such rates are  wildly optimistic under production conditions), together with the (unrealistic) assumption that all such failures are statistically independent.  Moreover, this ‘design reliability’ claim is unsupported (at time of writing) by Glacier’s terms of service, SLA, or customer agreement. To the contrary, the agreements appear to indemnify Amazon against any loss of damage, does not appear to offer a separate SLA for Glacier, and limits recovery under existing SLA’s (for services, such as S3) to refund of fees for periods the service was unavailable. If Amazon were highly confident in the applicability of the quoted ‘design reliability’ to production settings, one might expect a stronger SLA. However, despite these caveats, my guess is that Glacier’s will still turn out to be, in practice, substantially more reliable than the DIY solutions that most individual organizations can afford to implement entirely in-house.

Nevertheless, as previously discussed (most recently at Digital Preservation 2012), a large part of risk mitigation for digital assets is to diversify against sources of correlated failure. Although implementation details are not complete, Glacier does appear to diversify against some common risks to bits — primarily media failure, hardware failure, and localized natural disaster (such as fire, flood). This is good, but far from complete. A number of likely single-point (or highly correlated) vulnerabilities remain, including software failure (e.g. a bug in the AWS software for its control backplane might result in permanent loss that would go undetected for a substantial time; or cause other cascading failures — analogous to those we’ve seen previously); legal threats (leading to account lock-out — such as this, deletion, or content removal); or other institutional threats (such as a change in Amazon’s business model). It is critical that diversification against these additional failures be incorporated into a digital preservation strategy.

Preliminary Recommendations

To sum up, Glacier is an important service, and appears to be a solid option for cold storage, but institutions that are responsible for digital preservation and long-term access should not use the quoted design reliability in modeling likelihood of loss, nor rely on Glacier as the sole archival mechanism for their content.


Pages