Mar 27, 5:45pm

Scholarly publishers, research funders, universities, and the media, are increasingly scrutinizing research outputs. Of major concern is the integrity, reliability, and extensibility of the evidence on which published findings are based. A flood of new funder mandates, journal policies, university efforts, and professional society initiatives aim to make this data verifiable, reliable, and reusable: If “data is the new oil”, we need data management to prevent ‘fires’, ensure ‘high-octane’, and enable ‘recycling’.

In March, I had the pleasure of being the inaugural speaker in a new lecture series ( initiated by the Libraries at the Washington University in St. Louis Libraries — dedicated to the topics of data reproducibility, citation, sharing, privacy, and management.

In the presentation embedded below, I provide an overview of the major categories of new initiatives to promote research reproducibility, reliability, and reuse and related state of the art in informatics methods for managing data.

This blog post provides some wider background for the presentation, and a recap of its recommendations. The approaches can be roughly divided into three categories. The first approach focuses on tools for reproducible computation ranging from “statistical documents” (incorporating Knuth’s [1992] concept of literate programming) to workflow systems and reproducible computing environments [for example, Buckheit & Donoho 1995; Schwab et al. 2000; Leisch & Rossini 2003; Deelman & Gils 2006; Gentleman & Temple-Lang 2007] With few exceptions [notably, Freire, et al. 2006] this focuses primarily on “simple replication” or “reproduction” –replicating exactly a precise set of result from an exact copy of original data made at the time of research.

Current leading examples of tools that support reproducible computation include:


The second approach focuses on data sharing methods and tools [see for example, Altman et al 2001; King 2007; Anderson et al., 2007; Crosas 2011]. [1]  This approaches more generally on helping researchers to share — both for replication and for broader reuse – including secondary uses and use in teaching. Increasingly work in this area [e.g. Gutmann 2009; Altman-King 2007] focuses on issues of enabling long-term and interdisciplinary access to data – this requires that the researchers’ tacit knowledge about data formats, measurement, structure and provenance be more explicitly documented.

Current leading examples of informatics tools that support data sharing include:


The third approach focuses on the norms, practices and licensing associated with data sharing archiving and replication and the related incentives embedded in scholarly communication [Pienta 2007; Hamermesh 2007; Altman & King 2007; King 2007; Hedstrom et al. 2008; McCullough 2009; Stodden 2009].  This approach seeks to create the necessary conditions to enable data sharing and reuse, and to examine and align citations around citation, data sharing, and peer review to encourage replicability and reusability.

Current leading examples of informatics tools that support richer citation, evaluation, open science, and review include:


Many Tools, Few Solutions

In this area, there are many useful tools, but few solutions that offer a complete solution – even for a specialized community of practice. All three approaches are useful, and here are several general observations to be made about them. First, tools for replicable research such as VisTrails, MyExperiment, Wings, and StatDocs are characterized by their use of a specific and controlled defined software framework and their ability to facilitate near automatic replication. The complexity of these tools, and their small user and maintenance base means that we cannot rely on them to exist and function in five-ten years – they cannot ensure long term access. Because they focus only on results and not on capturing practices, descriptive metadata and documentation, they allow exact replication without providing the contextual information necessary for broader reuse.  Finally these tools are heterogeneous across subdisciplines, and largely incompatible, they do not as yet offer a broadly scalable solution.

Second, tools and practices for data management have the potential to broadly increase data sharing and the impact of related publications However, although these tools are becoming easier to use, they still require an extra effort for the researcher. Moreover, since additional effort often comes near (or past) the conclusion of the main research project (and only after acceptance of an article and preparation for final publication) it is perceived as a burden, and often honored in the breach.

Third, incentives for replication have been weak in many disciplines – and journals are a key factor. The reluctance of journal editors to publish articles either confirming or non-confirming replications work authors’ incentives to create replicable work.  Lack of formal provenance and attribution practices for data also weakens accountability, raises barriers to conducting replication and reuse, reduces incentive to disseminate data for reuse, and increases the ambiguity of replication studies, making them difficult to study.

Furthermore, new forms of evidence complicate replication and reuse. In most scientific disciplines, the amount of data potentially available for research is increasing non-linearly.  In addition, changes in technology and society are greatly affecting the types and quantities of potential data available for scientific analysis, especially in the social sciences. This presents substantial challenges to the future replicability and reusability of research. Traditional data archives currently consist almost entirely of numeric tabular data from noncommercial sources. New forms of data differ from tabular data in size, format, structure, and complexity. Left in its original form, this sort of data is difficult or for scholars outside of the project that generated it to interpret and use. This is a barrier to integrative and interdisciplinary research, but also a significant obstacle to providing long-term access, which becomes practically impossible as the tacit knowledge necessary to interpret the data is forgotten. To enable broad use and to secure long term access requires more than simply storing the individual bits of information – it requires establishing and disseminating good data management practices. [Altman & King 2007]

How research libraries can jump-start the process.

Many research libraries should consider at least three steps:

1. Create a dataverse hosted by the Harvard Dataverse Network ( ). This provides free, permanent storage, dissemination, with bit-level preservation insured by Harvard’s endowment.  The dataverse can be branded, curated, and controlled by the library – so it enables libraries to maintain relationship with their patrons, and provide curation services, with minimal effort. (And since DVN is open-source, a library can always move from the hosted service to one they run themselves.

2. Link to DMPTool ( from your libraries website. And consider joining DMPTool as an institution – especially if you use Shibboleth (Internet2) to authorize your users.   You’ll be in good company — according to a recent ARL survey 75% of  ARL libraries are now at least linking to DMPTool.  Increasing researchers use  of DMPtool provides early opportunities for conversation with libraries around data, enables libraries to offer service at a time when it is salient to the researcher , and provides a information which can be used to track and evaluate data management planning needs.

3. Publish a “libguide” focused on helping researchers get more credit for their work.  This is a subject of intense interest, and the library can provide information about trends and tools in the area that researchers (especially junior researchers) of which researchers may not be aware. Some possible topics to include: Data citation (e.g. the ); researcher identifiers (e.g., ); and impact metrics ( .






Altman, M., L. Andreev, M.  Diggory, M. Krot, G. King, D. Kiskis, A. Sone, S. Verba,  A Digital Library for the Dissemination and Replication of Quantitative Social Science Research, Social Science Computer Review 19(4):458-71. 2001.

Altman, M. and G. King. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib Magazine 13(3/4).  2007.

Anderson, R.  W. H. Greene, B. D. McCullough and H. D. Vinod. “The Role of Data/Code Archives in the Future of Economic Research,” Journal of Economic Methodology. 2007.

Buckheit, J. and D.L. Donoho,Wavelan and Reproducible Research, in A. Antoniadis (ed.) Wavelets and Statistics, Springer-Verlag. 1995.

Crosas, M., The Dataverse Network®: An Open-Source Application for Sharing, Discovering and Preserving Data, D-lib Magazine 17(1/2). 2011.

D.S. Hamermesh, “Viewpoint: Replication in Economics,” Canadian Journal of Economics. 2007.

Deelman, E. Y. Gil, (Eds.). Final Report on Workshop on the Challenges of Scientific Workflows.  2006. <>

Freire, J., C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T. Vo. Managing rapidly-evolving scientific workflows. In International Provenance and Annotation Workshop (IPAW), LNCS 4145, 10-18, 2006.

Gentleman R., R. Temple Lang. Statistical Analyses and Reproducible Research,  Journal of Computational and Graphical Statistics 16(1): 1-23. 2007.

Gutmann M., M. Abrahamson, M. Adams, M. Altman, C.  Arms,  K.  Bollen, M. Carlson, J. Crabtree, D. Donakowski, G. King, J. Lyle, M. Maynard, A. Pienta, R. Rockwell, L. Timms-Ferrara, C. Young,  “From Preserving the Past to Preserving the Future: The Data-PASS Project and the challenges of preserving digital social science data”, Library Trends 57(3):315-337. 2009.

Hedstrom, Margaret, Jinfang Niu, Kaye Marz,. “Incentives for Data Producers to Create “Archive/Ready” Data: Implications for Archives and Records Management”, Proceedings of the Society of American Archivists Research Forum. 2008.

King, G. “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing.” Sociological Methods and Research, 32(2), 173–199. 2007.

Knuth, D.E., Literate  Programming, CLSI Lecture Notes 27. Center for the Study of Language and Information.  Stanford, Ca. 1992.

Leisch F., and A.J. Rossini, Reproducible Statistical Research, Chance 16(2): 46-50. 2003.

McCullough, B.D., Open Access Economics Journals and the Market for Reproducible Economic Research, Economic Analysis & Policy 39(1). 2009.

Pienta, A., LEADS Database Identifies At-Risk Legacy Studies, ICPSR Bulletin 27(1) 2006.

Schwab, M., M. Karrenbach, and J. Claerbout, Making Scientific Computations Reproducible, Computing in Science and Engineering 2: 61-67. 2000.

Stodden, V.The Legal Framework for Reproducible Scientific Research: Licensing and Copyright, Computing in Science and Engineering 11(1):35-40. 2009.



[1] Also see for example the CRAN reproducible research task view:;  and the Reproducible Research tools page:

Mar 27, 2:53pm

FTC  has been hosting a series of seminars on consumer privacy, on which it has requested comments. The most recent seminar explored privacy issues related to mobile device tracking. As the seminar summary points out …

In most cases, this tracking is invisible to consumers and occurs with no consumer interaction. As a result, the use of these technologies raises a number of potential privacy concerns and questions.

The presentations raised an interesting and important combination of questions about how to promote business and economic innovation while protecting individual privacy.  I have submitted a comment on these changes with some proposed recommendations.

To summarize (quoting from the submitted the comment):

Knowledge of an individual’s location history and associations with others has the potential to be used in a wide variety of harmful ways. … [Furthermore], since all physical activity has a unique spatial and temporal context, location history provides a linchpin for integrating multiple sources of data that may describe an individual. [R]esearch shows that human mobility patterns are highly predictable and that these patterns have unique signatures, making them highly identifiable– even in the absence of associated identifiers or hashes. Moreover, locational traces are difficult or impossible to render non-identifiable using traditional masking methods.

I invite you to read the full comment here:

This comment drew heavily on previous comments on proposed OSHA regulation made with colleagues at the Berkman Center, David O’Brien, Alexandra Woods, was made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.


Mar 08, 12:03pm

OSHA has proposed a set of set of changes to current tracking of workplace injuries and illnesses.

Currently information about workplace injuries and illnesses must be recorded, but only on paper. Further most of this information is never reported —  OSHA only receives detailed information  when it conducts an investigation, and receives  summary records from  only a small percentage of employers who are selected to participate in the annual survey. (Additionally BLS receives a sample of this information in order to produce specific statistics for  its “Survey of Occupational Injuries and Illnesses”

OSHA proposes three changes. The first change would require establishment to regularly  submit the information that they are already required to collect  and maintain (quarterly submission of detailed information for larger establishment, and annual submission of summary information from any establishment with more than twenty employees that is already required to maintain these records) . The second change makes this process digital — submissions would be electronic, instead of on paper. And the third change would be to make the data collected public — searchable, and downloadable in machine-actionable (.csv) form.

 These proposed changes raise an interesting and important combination of questions about how to promote government (and industry) transparency while protecting individual privacy. My colleagues at the Berkman Center, David O’Brien, Alexandra Woods, and I have submitted an extensive comment on these changes with some proposed recommendations. This comment is made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.

To summarize (quoting from the conclusions of the comment):

We argue that workplace injury and illness records should be made more widely available because releasing these data has substantial potential individual, research, policy, and economic benefits. However, OSHA has a responsibility to apply best practices to manage data privacy and mitigate potential harms to individuals that might arise from data release.

The complexity, detail, richness, and emerging uses for data create significant uncertainties about the ability of traditional ‘anonymization’ and redaction methods and standards alone to protect the confidentiality of individuals. Generally, one size does not fit all, and tiered modes of access – including public access to privacy-protected data and vetted access to the full data collected – should be provided.

Such access requires thoughtful analysis with expert consultation to evaluate the sensitivity of the data collected and risks of re-identification and to design useful and safe release mechanisms.

I invite you to read the full comment here:

Mar 05, 3:21pm

Sound, reproducible scholarship rests upon a foundation of robust, accessible data.  For this to be so in practice as well as theory, data must be accorded due importance in the practice of scholarship and in the enduring scholarly record.  In other words, data should be considered legitimate, citable products of research.

A few days ago I was  honored to officially announce the  Data Citation Working Group’s Joint Declaration of Data Citation Principles  at IDCC 2014, from which the above quote is taken.

This Joint Data Citation Principles identifies guiding principles for the scholarly citation of data. This recommendation is a s collaborative work with CODATA, FORCE 11, DataCite and many other individuals and organizations.   And in the week since it has been released, it has already garnered over twenty institutional endorsements.

Some slides introducing the principles are here:

To summarize, from 1977 through 2009 there were three phases of development in the area of data citation.

  • The first phase of development focused on the role of citation to facilitate description and information retrieval. This phase introduced the principles that data in archives should be described as works rather than media, using author, title, and version.
  •  The second phase of development extended citations to support data access and persistence. This phase introduced the principles that research data used by publication should be cited, that those citations should include persistent identifiers, and that the citations should be directly actionable on the web.
  •  The third phase of development focused on using citations for verification and reproducibility. Although verification and reproducibility had always been one of the motivations for data archiving – it had not been a focus of citation practice. This phase introduced the principles that citations should support verifiable linkage of data and published claims, and it started the trend towards wider integration with the publishing ecosystem

And over the last five years the importance and urgency of scientific data management and access has been  recognized more broadly. The culmination of this trend toward increasing recognition, thus far, is an increasingly widespread consensus by researchers and funders of research that data is a fundamental product of research and therefore a citable product. The fourth and current phase of data development work focuses on integration with the scholarly research and publishing ecosystem. This includes integration of data citation in standardized ways within publication, catalogs, tool chains, and larger systems of attribution.

Read the full recommendation here, along with examples, references and endorsements:

 Joint Declaration of Data Citation Principles

Feb 13, 2:34pm

Our guest speaker,  Cavan Capps,  who is Big Data Lead services presented this  talk as part of the Program on Information Science Brown Bag Series.

Cavan Capps is the U.S. Census Bureau’s Lead on Big Data processing. In that role he is focusing on new Big Data sources for use in official statistics, best practice private sector processing techniques and software/hardware configurations that may be used to improve statistical processes and products. Previously, Mr. Capps initiated, designed and managed a multi-enterprise, fully distributed, statistical network called the DataWeb.

Capps provided the following summary of his talk.

Big Data provides both challenges and opportunities for the official statistical community.  The difficult issues of privacy, statistical reliability, and methodological transparency will need to be addressed in order to make full use of Big Data in the official statistical community.  Improvements in statistical coverage at small geographies, new statistical measures, more timely data at perhaps lower costs are the potential opportunities. This talk will provides an overview of some of the research being done by the Census Bureau as it explores the use of “Big Data” for statistical agency purposes.

And he has also described the U.S. census efforts to incorporate big data in in this article:

What struck me most about Capps’ talk and the overall project is how many disciplines have to be mastered to find an optimal solution.

  • Deep social science knowledge (especially economics, sociology, psychology, political science) to design the right survey measures, and to come up with theoretically and substantively coherent alternative measures;
  • carefully designed machine learning algorithms are needed to extract actionable information from non-traditional data sources; 
  • advances in statistical methodology are needed to guide adaptive survey design; make reliable inferences over dynamic social networks; and to measure and correct for bias in measures generated from non-traditional data sources and non-probability samples;
  • large scale computing is needed to do this all in real time;
  • information privacy science is required to ensure the results that are released (at scale, and in real time) continue to maintain the public trust; And…
  • information science methodology is required to ensure the quality, versioning, authenticity, provenance and reliability that is expected of the US Census.

This is indeed a complex project. And also given the diversity of  areas implicated, a stimulating one — it has resonated with many different projects and conversations at MIT.

Feb 07, 8:49am

When is “free to use” not free? … when it doesn’t establish  specific rights.

NISO offered a recent opportunity to comment on the draft recommendation on ‘Open Access Metadata and Indicators’. The following is my commentary on the draft recommendation. You may also be interested in reading the other commentaries on this draft.

Response to request for public comments on on ‘Open Access Metadata and Indicators’

Dr. Micah Altman
Director of Research; Head/Scientist, Program on Information Research — MIT Library, Massachusetts Institute of Technology
Non-Resident Senior Fellow, The Brookings Institution


Thank you for the opportunity to respond to this report. Metadata and indicators for Open Access publications are an area in which standardization would benefit the scholarly community. As a practicing social scientist, a librarian, and as MIT’s representative to NISO, I have worked extensively with (and contributed to the development of) metadata schemas, open and closed licenses, and open access publications. My contribution is made with this perspective.



The scope of the work originally approved by NISO members was to develop metadata and visual indicators that would enable a user to determine whether a specific article is openly accessible, and what other rights are available. The current draft is more limited in in two ways: First, the draft does not address visual indicators. Second, the metadata field proposed in the draft signals only the license available without providing information on specific usage rights.

The first limitation is a pragmatic limitation of scope. Common, well-structure metadata is a pre-condition for systematic and reliable visual indicators. These indicators may be developed later by NISO or other communities.

The second limitation in scope is more fundamental and problematic. Metadata indicating licenses is less directly actionable.  A user can take direct actions (e.g., to read, mine, disseminate, or reuse content) based on knowledge of rights, but cannot take such actions knowing only the URI of the license — unless the user has independently determined what rights are associated with every license encountered. Moreover, different users may (correctly or incorrectly) interpret the same license as implicating different sets of rights. This creates both additional effort and risk for users, which greatly limits the potential value of the proposed practice.

The implicit justification for this limitation of scope is not clearly argued. However it seems to be based the claims, made on page 2, that “This is a contentious area where political views on modes of access lead to differing interpretations of what constitutes ‘open access’” and  “Considering the political, legal, and technical issues involved, the working group agreed that a simple approach for transmitting a minimal set of information would be preferred.”  [1]

This is a mistake, in my judgment, since contention (or simply uncertainty) over the definition of “open access” notwithstanding,  there is a well-established and well-defined set of core criteria that apply to open licenses; these include: the 10 criteria comprising the Open Source Initiatives Open Source Definition [2]; the 11 criteria comprising the Open Knowledge Foundation Open Definition (many of which are reused from the OSI criteria) [3]; and the four license properties defined by creative commons. [4] These criteria currently are readily applicable to dozens of existing independently-created open licenses [5], which have been applied to millions of works. Although these properties may not be comprehensive, nor is there a universal agreement over which of these properties constitute “open access”, any plausible definition of Open Access should include at least one of these properties. Thus a metadata schema that could be used to signal these properties is feasible, and could be reliably used to indicate useful rights. These right elements should be added to complement the proposed license reference tag, which could be used to indicate other rights not covered in the schema.

Design Issues

The proposed draft lists nine motivating use cases. The general selection of use cases appears appropriate. However, the definition of success for each use case is not clearly defined, making the claim that use cases are satisfied arguable. Moreover, the free_to_read element does adequately address the use cases to which it is a proposed solution. The free_to_read element is defined as meaning that “content can be read or viewed by any user without payment or authentication” (pg 4) the purpose is to  “provide a very simple indication of the status of the content without making statements about any additional re-use rights or restrictions.” No other formal definition of usage rights or conditions for this element is provided in the draft.

Under the stated definition, rights  other than reading could be curtailed in any variety of ways —  including for example a restriction on the right to review, criticize or comment upon the material.  Thus the rights implied by the free_to_read  element are less then the minimal criteria provided by any plausible open access license. Furthermore, this element is claimed to comprise part of the solution to compliance evaluation use cases (use cases 5.8 and 5.9 in the draft). It cannot support that purpose – compliance auditing relies upon well-defined criteria, and the free_to_read  definition is fatally ambiguous. The free_to_read element should be removed from the draft. It should be replaced by a metadata attribute indicating readability as defined by metadata indicating ‘access’ rights, as defined by the open criteria listed above. [6]

Technical Issues

  • Finally, there are a number of changes to the technical implementation proposed:
  • There should be a declared namespace for the proposed XML license elements, so they can be used in a structured way in XML documents without including them separately in multiple schema definitions
  •  Semantic markup (e.g. RDF) is required so that these elements may be used in non-XML metadata
  •  A schema should be supplied that formally and unambiguously defines: which elements and attributes are required, which are repeatable, what datatypes (e.g. date formats) are allowable, and any implicit default values
  • Make explicit that license_ref may  refer to waivers (e.g. CC0) as well as licenses.


[1] The draft also raises the concern, on page 4, that “no legal team is going to agree to allow any specific use based on metadata unless they have agreed that the license allows.” The evidence for this assertion is unclear. However, even if true, it can easily be addressed by using the criteria above to design standard rights metadata profiles for each license to complement the metadata attributes associated with individual document. A legal team can then vet the profiles associated with a license, could certify a registry responsible for maintaining such profiles, or could agree to accept profiles available from the licenses authors.





[6] Alternatively, if the definition of metadata elements is simply beyond the capacity of this project, free_to_use should simply be replaced by a license_ref instance that includes a URI to a well known license that  established free_to_read. The latter could even be designed for this purpose. This would at least remove the ambiguity of the free_to_read condition, and further simplify the schema.

Jan 31, 4:14pm

“Could new Maker Spaces together with a reinforced commitment to learning-by-doing create the next generation of tinkerers, fluent in advanced manufacturing and rapid prototyping techniques?” [1]

Rapid fabrication resonates particularly well with “mens et manus” , the MIT philosophy of combining learning and doing. And Neil Gershenfeld has noted that  MIT has had a long-standing joke that a student is allowed to graduate only when their thesis can walk out of the printer. 

For the last year, the Institute has been thoughtfully reflecting on the future of education, and how “doing” will remain a part of it. One exciting vision involves organizing around a combination of academic villages and maker spaces that catalyze and combine on-line activities, in-person interactions and hands-on experiences.

My colleague Matt Bernhardt  was prevailed upon to give an overview of some of the key technologies that promise to enable this future. Matt, who is the Libraries current Web Developer, was trained as an architect and  founded and ran a fabrication space at the University of Ohio, and is acting as an expert advisor. We collaborated to organize a workshop summarizing the current generation of rapid fabrication technologies as an IAP session and as part of the Program on Information Science Brown Bag Series.

Matt’s excellent talk provided a general overview of the digitation-fabrication cycle and the broad categories of technologies for 3-d scanning and for rapid fabrication: subtractive, deformative, and additive methods, and their variants. His talk also provides exemplars of the state-of-the-practice in additive fabrication technologies, emerging methods and the range of capabilities (e.g. for scale, materials, precision) currently available in practice. (These slides are embedded below: )

For thousands of years, libraries have had a major role in the discovery, management, and sharing of information. Rapid fabrication can be seen, in a way, as offering the ability to materialize information. So the question of  what roles the libraries might take on in relationship to supporting  fabrication and  managing the intellectual assets produced and used  is of natural interest from a library and information science point of view.  And it is not just of theoretical interest —  a recent survey by the Garnder-Harvey Library found that  substantial proportion of libraries were providing or planning to provide at least some support for “Maker Spaces”.

As  a complement to Matt’s talk, I outlined in the presentation below how fabrication fits into the research information life cycle, and some of the models for library support:

Clearly this area is of interest to MIT affiliates. The IAP talk rapidly reached its cap of 35 registrants, with nearly 70 more on the wait list, and participants  in the session discussed exploring how rapid fabrication can be used in a variety of ongoing research and scholarly projects, such as collaborative design of a satellite, rapid development of robots, modeling human anatomy for biomedical engineering, and fashion!

More generally, fabrication technologies are now used in production for medical implants, prosthetics, teaching aids, information visualization, research on rare/fragile objects, architecture, art, and advanced manufacturing. And use for creating custom pharmaceuticals, “printing” or “faxing” biological systems, and for printing active objects with embedded sensors and electronics is on the horizon. Having a dissertation “walk out of the printer” won’t be a joke for much longer.

As Lipson and Kerman [3] astutely point out, rapid advances in fabrication technologies are rapidly lowering a number of different barriers faced by researchers (and others), barriers that had previously made it prohibitively difficult for most individuals, researchers, or organizations to manufacture objects without substantial investment in obtaining manufacturing skills and equipment; to manufacture complex objects; to offer a wide variety of different objects; to easily customize and individualize manufacturing; to manufacture objects locally, or on-site; to manufacture objects with little lead time (or just-in-time); or to easily and precisely replicate physical objects. Furthermore, as they point out, additive fabrication technologies open up new forms of design (“design spaces”) such as localized design (based on local conditions and needs), reactive design (where objects are manufactured that collect sensor information that is then used to manufacture better objects), generative design (physical objects based on mathematical patterns and processes), and the application sample-remix-and-burn to physical objects.

Increasingly, fabrication is becoming part of various stages of the research lifecycle. These technologies may be use early on as part of prototyping for research interventions or to embed sensors for research data collection; or later on as part of analysis or research collaboration (e.g. by materializing models for examination and sharing). And, naturally, these technologies produce intellectual assets — sensor data and digitization, models and methods, that are potentially valuable to other researchers for future reuse and replication. The Library may have a useful role to play in managing these assets.

And this is only the beginning. Current technologies allow control over shape. Emerging technologies (as Matt’s talk shows) are beginning to allow control over material composition. And as any avid science-fiction reader could tell you — control over the behavior of matter is next, and a real replicator should be able to print a table that can turn into a chair at the press of a button. (Or for those aficionados of 70’s TV —  a floor wax that can turn itself into a dessert topping. )

Libraries have a number of core competencies that are complementary to fabrication.

  • Libraries have special competency in managing information. Fabrication technologies make information material and help make material objects into information.
  • Libraries support the research process. Use of fabrication technologies requires a core set of skills and knowledge (such as databases of models) outside of specific research domainsand requires skills and knowledge that are not in the sole domain  of any one discipline.
  • Libraries promote literacy broadly. And the use of fabrication technologies promote design, science, technology, engineering, art, and mathematics.
  • Libraries are responsible for maintaining the scholarly record. The digitizations, designs, and models produced as part of rapid fabrication approaches can constitute unique & valuable parts of the scholarly record. 
  • Libraries create physical spaces designed for research and learning. Successful ‘makerspaces’ bring together accessible locations; thoughtfully designed space; curated hardware & software; skilled staff;  local information management; and global ‘reference’ knowledge.

The seminars provoked a lively discussion, and this is a promising area for  further experiments and pilot projects. The Program has invested in an MakerBot and 3d scanner for use in further exploration and pilot projects, and our program intern, is currently conducting a review of existing websites, policies, and documentation in support of rapid fabrication at other libraries.


[1] Institute-Wide Task Force on the Future of MIT Education, Preliminary Report. <>


[3] Lipson & Kerman, 2013. Fabricated. Wiley.

Jan 18, 9:19am

What’s new in managing confidential research data this year?

For MIT’s independent activities periods (IAP) the Program on Information Science regularly leads a practical workshop on managing confidential data.  This is in part a result of research through the Privacy Tools project.  As I was updating the workshop for this semester, I had an opportunity to reflect upon what’s new on the pragmatic side of managing confidential information.

Most notably, because of the publicity surrounding the NSA, more people (and in higher places) are paying attention.  (And as an information scientist I note that one benefit of the NSA scandal is that everyone now recognizes the term “metadata”).

Also, generally, personal information continues to become more available  and  increasingly easy to link information to individuals. New laws, regulations and policies  governing information privacy continue to emerge, increasing the complexity of management. . Trends in information collection and management — cloud storage, “big” data,  and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective.

On the pragmatic side, new privacy laws continue to emerge at the state level. Probably the most notable is the California “right to be forgotten”  — for teens. This year California became the  the first state to pass a law  (“The Privacy Rights for California Minors in the Digital World”)  that gives (some) individuals the right to remove (some) content they have posted online.
The California law takes effect next year (Jan 1, 2015) — by which time we’re likely to see new information privacy initiatives in some other states. This year wa are also likely to see the release of specific  data sharing requirements from federal funders (as a result of the OSTP “Holdren Memo”, NIH’s big data to knowledge initiative, and related efforts); from journals and from professional societies. Farther off in the wings looms the possibility of a general right to be forgotten law in the EU; changes to how the “common rule” evaluates information risks and controls (on which subject the NAS recently issued a new set of recommendations); and possible “sectoral” privacy laws targeted at “revenge-porn”, “mug-shot” databases, mobile-phone data, or other issues-de-jour.
This creates an interesting tension and will require increasingly sophisticated approaches that can provide both privacy and  appropriate access.  From a policy point of view one possible way of setting this balance is by using “least restrictive terms” language — the OKF’s open economic principles may provide a viable approach.
In a purely operational sense — the biggest change in confidential data management for researchers is the wider availability of “safe-sharing” services for exchanging research data within remote collaborations:
  • On the do-it-yourself front. The increasing flexibility of the FISMA-certified Amazon Web Services  GovCloud makes running a remote, secure research computing environment easier and more economical. Although this still complex and expensive to maintain, and one still has to trust Amazon — although the FISMA certifications make that trust better justified.
  • The second widely used option — combining file-sharing services like DropBox with encrypted filesystems like TrueCrypt also received a boost this year, with the success of a crowdfunded effort to independently audit the TrueCrypt source. This is good news, and the transparency and verifiability of TrueCrypt is its big strength. The approach  remains limited  in practice to secure publishing of information — it doesn’t support simultaneous remote updates (not unless you like filesystem corruption); multiple keys for different users or portions of the filesystem; key distribution — etc.
  • A number of simpler solutions have emerged this year.
    – Bittorrent Sync provides “secure” P2P replication and sharing based on a secret private key.
    – SpiderOak Hive;; and BoxCryptor all offer zero-knowledge cloud-storage, client-side encrypted data sharing. The ease of use and functionality of these systems for secure collaboration is very attractive compared to the other available solutions. BoxCryptor offers an especially wide a range of enterprise features such as  key distribution, revocation, master and group-key-chaining, and other enterprise features, that would make managing sharing among heterogenous groups easier. However, the big downside is the amount of “magic” in these systems. None are open source, nor are any sufficiently well documented (at least externally) or certified (no FISMA, there) to engender trust among us untrusting folk…  ( Although   SpiderOak in particular seems to have a good reputation for trustworthiness…  and the others no doubt have pure hearts, I’d rest easier with the ability to audit source codes, peer-reviewed algorithms, etc.)

For those interested in the meat of the course, which gives an overview of legal, policy, information technology/security, research design, and statistical pragmatics, the new slides are here:

Jan 08, 2:43pm

In December, my colleagues from NDSA and I had the pleasure of attending CNI to present the 2014 National Agenda for Digital Stewardship and to lead a discussion of priorities for 2015. We were gratified to have the company of a packed room of engaged attendees, who participated in a thoughtful and lively discussion.

For those who were unable to attend CNI, the presentation is embedded below.

(Additionally, the Agenda will be discussed this Spring at NERCOMP, in a session I am leading especially for Higher Education IT leaders; at IASSIST in a poster session, represented by Coordination Committee member Jonathan Crabtree; and at IDCC, in a poster session represented by Coordination Committee member Helen Tibbo.)

Discussions of the Agenda at CNI were a first step in the input gathering for the next version of the Agenda. In January, NDSA will start an intensive and systematic process of revising the Agenda for priorities in 2015 and beyond. We expect to circulating these revisions for peer and community review in April and present a final or near-final version (depending on review comments) at the annual Digital Preservation conference in July.

Part of the discussions at our CNI session echoed selected themes in Cliff Lynch’s opening plenary “Perspective” talk, particularly his statements that:

  • We [as a stewardship community] don’t know how well we’re doing with our individual preservation efforts, in general. — We don’t have an inventory of the class of content that is out there, what is covered, and where the highest risks are.

  • There is a certain tendency to “go after the easy stuff”, rather than what’s at risk – our strategy needs to become much more systematic.

In our discussion session these questions were amended and echoed in different forms:

  • What are we doing in the stewardship community, and especially what are we doing well?

  • What makes for collaborative success, and how do we replicate that?

I was gratified that Cliff’s questions resonated well with the summary we’d articulated in the current edition of the National Agenda. The research section, in particular, lays out key questions about information value, risk assessment, and success evaluation, and outlines the types of approaches that are most likely to lead to the development of a systematic, general evidence base for the stewardship community. Moreover, the Agenda calls attention to many examples of things we are doing well.

That said, a question that was posed at our session, and that I heard echoed repeatedly at side conversations during CNI, was “Where are we (as a group, community, project, etc.)  getting stuck in the weeds?”

This question is phrased in a way that attracts negative answers — a potentially positive and constructive rephrasing is: What levels of analysis are most useful for the different classes of problems we face in the stewardship community?

As an information scientist and a formally (and mathematically) trained social scientist, I tend to spend a fair amount of time thinking about and building models of individual, group, and institutional behaviors, tracing the empirical implications of these models, and designing experiments (or seeking natural experiments) that have the potential to provide evidence to distinguish among competing plausible models. In this general process, and in approaching interventions, institutions, and policies generally, I’ve found the following levels of abstraction perennially useful:

The first level of analysis concerns local engineering problems, in which one’s decisions neither affect the larger ecosystem nor provoke strategic reactions by other actors. For example, the digital preservation problem of selecting algorithms for fixity, replication, and verification to yield cost-effective protection against hardware and media failures is best treated at this level in most cases. For this class of problem, the tools of decision theory [1] (of which “cost-benefit” analysis is a subset), economic comparative statics, statistical forecasting, monte-carlo simulation, and causal inference [2] are helpful.

The second level concerns tactical problems, in which other actors react and adapt to your decisions ( e.g., to compete, or to avoid compliance), but the ecosystem (market, game structure, rules, norms) remains essentially fixed. For example, the problem that a single institution faces in setting (internal/external) prices (fees/fines) or usage and service policies; is a strategic one. For tractional problems, applying the tools described above is likely to yield misleading results, and some form of modeling is essential — models  from game theory, microeconomics, behavioral economics, mechanism design, and sociology are often most appropriate. Causal inference remains useful, but must be combined with modeling.

The Agenda itself is not aimed at these two levels of analysis; however, much of the NDSA working groups‘ projects and interests are at the first, local-engineering level:  NDSA publications such as content case studies, the digital preservation in a box toolkit, and the levels of preservation taxonomy may provide guidance for first-level decisions. Many of the the other working group outputs such as storage, web, and staffing surveys, although they do not describe tactical models, do provide baseline information and peer comparisons to inform decisions at the tactical level.

The third level is systems design (in this case legal-socio-technical systems) — in these types of problems, the larger environment (market, game structure, rules, norms) can be changed in meaningful ways.  Systems analysis involves understanding an entire system of simultaneous decisions and their interactions and/or designing an optimal (or at least improved) system. Examples of systems analysis are common in theory: any significant government regulation/legislation should be based on systems analysis. For institutional scale systems analysis, a number of conceptual tools are useful, particularly market design and market failure [3]; constitutional  design [4]; and the co-design of institutions and norms to manage “commons” [5].

Working at this level of analysis is difficult to do well: One must avoid the twin sins of getting lost in the weeds (too low a level of analysis for the problem) and having one’s head in the clouds (thinking at such level of generality that analysis cannot be practically applied, or worse, is vacuous). Both the Agenda and Cliff’s landscape talk are aimed at this level of analysis and manage to avoid both sins to a reasonable degree.

Academics often do not go beyond this level of designing systems that would be optimal (or at least good) and stable if actually implemented. However, it’s exceedingly rare that a single actor (or unified group of actors) has the opportunity to design entire systems at institutional scale– notable examples are the authoring of constitutions, and (perhaps) the use of intellectual property law to create new markets.

Instead, policy makers, funders and other actors with substantial influence at the institutional level are faced with a fourth level of analysis —  represented by the question of “Where do I put attention, pressure, or resources to create sustainable positive change”? And system-design alone doesn’t answer this: Design is essential for identifying where one wants to go, but policy analysis and political analysis are required to understand what actions to take in order to get (closer to) there.

This last question, of the “where do we push now” variety, is what I’ve come to expect (naturally) from my boss, and from other leaders in the field. When pressed, I’ve thus far managed to come up with (after due deliberation) some recommendations (or, at least, hypotheses) for action, but these generally seem like the hardest level of solution to get right, or even to assess as good or bad. I think the difficulty comes from having to have both a coherent high-level vision (from the systems design level) while simultaneously getting back “down into the details” (though not, “in the weeds”) to understand the current arrangements and limitations of power, resources, capacity, mechanism, attention, knowledge, and stakeholders.

Notwithstanding, although we started by aiming more at systems design than policy intervention, some recommendations of the policy intervention sort are to be found in the current Agenda. I expect that this year’s revisions, and the planned phases of external review and input, will add more breadth to these recommendations, but that it will require years more of reflection, iteration, and refinement to identify specific policy recommendation across the entire breadth of issues covered by the Agenda at the systems level.


[1] For an accessible broad overview of decision theory, game theory, and related approaches see M. Peterson [2009], An Introduction to Decision Theory . For a classic introduction to policy applications see Stokey & Zeckhauser 1978, A Primer for Policy Analysis.

[2] There are many good textbooks on statistical inference, ranging from the very basic, accessible and sensible Problem Solving by Chatfield (1995) to the sophisticated and modern Bayesian Data Analysis 3rd edition, by Gelman et. al (2013). There are relatively few good textbooks on causal inference — Judea Pearl’s (2009) Causality: Models, Reasoning and Inference 2nd edition is as definitive as a textbook can be, but challenging; Counterfactuals and Causal Inference: Methods and Principles for Social Research, Morgan & Winship’s textbook, is more accessible.

[3] Market failure is a broad topic, and most articles and even books address only some of the potential conditions for functioning markets. A, good, accessible overview is Stiglitz’s Regulation and Failure , but it doesn’t cover all areas. For information stewardship policy the economics of non-consumptive goods is particularly relevant — see Foray, The Economics of Knowledge (2006); and increasing returns and path dependence are particularly important in social/information network economies — see Arthur 1994, Increasing Returns and Path Dependence in the Economy.

[4] See Lijphart, Arend. “Constitutional design for divided societies.” Journal of democracy 15.2 (2004): 96-109. and Shugart, Matthew Soberg, and John M. Carey. Presidents and assemblies: Constitutional design and electoral dynamics. Cambridge University Press, 1992.

[5] The late Lin Ostrom work was fundamental in this area. See for example, Ostrom, Elinor. Understanding institutional diversity. Princeton University Press, 2009.

Dec 13, 9:26am

This talk, presented as part of the the Program on Information Science seminar series,  reflects on lessons learned about open data, public participation, technology, and data management from conducting crowd-sourced election mapping efforts.

This topic is discussed in detail in previous posts, most recently here .

We’ve also written quite a few articles, book chapters, etc. on the topic.