May 01, 5:10pm

More and more frequently, research and scholarly publications involve collaboration, and co-authorship in most scholarly fields is rapidly increasing in frequency and complexity. Yet co-authorship patterns vary across discipline; the interpretation of author ordering is fraught with ambiguity; and richer textual descriptions of authorship are uncommon, and difficult to interpret, aggregate, and evaluate at scale.

There is a growing interest across stakeholders in scholarship in increasing the transparency of research contributions and in identifying the specific roles that contributors play in creating scholarly works. Understanding and accommodating emerging forms of co-authorship is critical for managing intellectual property, publication ethics, and effective evaluation.

The slides below provide an overview of the research questions in this area, and was originally presented as part of the Program on Information Science Brown Bag Series. Our upcoming panel at the Society for Scholarly Publishing, will to explore these issues further. 


Apr 17, 11:26am

Whose articles cite a body of work? Is this a high-impact journal? How might others assess my scholarly impact? Citation analysis is one of the primary methods used to answer these questions. Academics, publishers, and funders often study the patterns of citations in the academic literature in order to explore the relationships among researchers, topics, and publications, and to measure the impact of articles,  journals, and individuals.

MIT has a wonderful tradition of offering a variety of short courses during the Winter hiatus, known as IAP. The MIT Libraries generally offers dozens of courses on data and information management (among other topics) during this time. Moreover, the Libraries also hold bonus IAP sessions in April and July.

For IAPril, the Program on Information Science has introduced a new course on citation and bibliometric analysis, this provides a review of and citation and altmetric data; best-of-class free/open tools for analysis; and classes of measures.

The full (and updated) slides are below:


In addition a short summary of MIT resources, as summarized in a presentation to library Liaisons is available here:

Registration is available on the class site:

Apr 09, 3:53pm

I was honored to present some of our research at a panel on election reform at this year’s Harvard Law and Policy Review symposium .

As a summary of this the Harvard Law and Policy Review’s Notice and Comment Blog  has published a concise summary of recommendation for redistricting reform by my collaborator Michael McDonald and I, entitled:

Create Real Redistricting Reform through Internet-Scale Independent Commissions

To quote from the HLPR summary:

Twenty-first century redistricting should incorporate  transparency at internet speed and scale  – open source, open data, open process (see here for in-depth recommendations) — and twenty-first century redistricting should incorporate internet technology for twenty-first century participation: direct access to the redistricting process; access to legal-strength mapping tools; and the integration of  crowd-sourcing to create maps, identify communities and neighborhoods, collect and correct data, and gather and analyze public commentary.

There are few policy arenas where the public can fashion legitimate proposals that rival what their elected officials enact. Redistricting is among them, so why not enable greater public participation in this critical democratic process?

Read the rest of these recommendations on the Harvard Law and Policy Review — Notice and Comment site


(And n a related topic  this previous post summarizes some of our research on  crowd sourced mapping for open government .)

Mar 31, 10:56am

Big data has huge implications for privacy, as summarized in our commentary below:

Both the government and third parties have the potential to collect extensive (sometimes exhaustive), fine grained, continuous, and identifiable records of a person’s location, movement history, associations and interactions with others, behavior, speech, communications, physical and medical conditions, commercial transactions, etc. Such “big data” has the ability to be used in a wide variety of ways, both positive and negative. Examples of potential applications include improving government and organizational transparency and accountability, advancing research and scientific knowledge, enabling businesses to better serve their customers, allowing systematic commercial and non-commercial manipulation, fostering pervasive discrimination, and surveilling public and private spheres.

On January 23, 2014, President Obama asked John Podesta to develop in 90 days, a ‘comprehensive review’ on big data and privacy.

This lead to a series of workshop on big data and  technology at MIT, and on social cultural & ethical dimensions at NYU, with a third planned to discuss legal issues at Berkeley. A number of colleagues from our Privacy Tools for Research  project and from the BigData@CSAIL projects have contributed to these workshops  and raised many thoughtful issues (and the workshop sessions are  online and well worth watching).

EPIC, ARL and 22 other privacy organizations requested an opportunity to comment on the report, and OSTP later allowed for a 27-day commentary period during which brief comments would be accepted by e-mail. (Note that the original RFI provided by OSTP is, at the time of this writing, a broken link, so we have posted a copy. )They requested commenters provide specific answers to five extraordinarily broad questions:

  1. What are the public policy implications of the collection, storage, analysis, and use of big data? For example, do the current U.S. policy framework and privacy proposals for protecting consumer privacy and government use of data adequately address issues raised by big data analytics? 
  2. What types of uses of big data [are most important]… could measurably improve outcomes or productivity with further government action, funding, or research? What types of uses of big data raise the most public policy concerns? Are there specific sectors or types of uses that should receive more government and/or public attention?
  3. What technological trends or key technologies will affect the collection, storage, analysis and use of big data? Are there particularly promising technologies or new practices for safeguarding privacy while enabling effective uses of big data?
  4. How should the policy frameworks or regulations for handling big data differ between the government and the private sector? Please be specific as to the type of entity and type of use (e.g., law enforcement, government services, commercial, academic research, etc.). 
  5. What issues are raised by the use of big data across jurisdictions, such as the adequacy of current international laws, regulations, or norms? 

 My colleagues at the Berkman Center, David O’Brien, Alexandra Woods, Salil Vadhan and I have submitted responses to these questions that outline a broad, comprehensive, and systematic framework for analyzing these types of questions and taxonomize a variety of modern technological, statistical, and cryptographic approaches to simultaneously providing privacy and utility. This comment is made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.

Much can be improved in how big data and is currently treated. To summarize (quoting from the conclusions of the comment):

Addressing privacy risks requires a sophisticated approach, and the privacy protections currently used for big data do not take advantage of advances in data privacy research or the nuances these provide in dealing with different kinds of data and closely matching sensitivity to risk.

I invite you to read the full comment .

Mar 27, 5:45pm

Scholarly publishers, research funders, universities, and the media, are increasingly scrutinizing research outputs. Of major concern is the integrity, reliability, and extensibility of the evidence on which published findings are based. A flood of new funder mandates, journal policies, university efforts, and professional society initiatives aim to make this data verifiable, reliable, and reusable: If “data is the new oil”, we need data management to prevent ‘fires’, ensure ‘high-octane’, and enable ‘recycling’.

In March, I had the pleasure of being the inaugural speaker in a new lecture series ( initiated by the Libraries at the Washington University in St. Louis Libraries — dedicated to the topics of data reproducibility, citation, sharing, privacy, and management.

In the presentation embedded below, I provide an overview of the major categories of new initiatives to promote research reproducibility, reliability, and reuse and related state of the art in informatics methods for managing data.

This blog post provides some wider background for the presentation, and a recap of its recommendations. The approaches can be roughly divided into three categories. The first approach focuses on tools for reproducible computation ranging from “statistical documents” (incorporating Knuth’s [1992] concept of literate programming) to workflow systems and reproducible computing environments [for example, Buckheit & Donoho 1995; Schwab et al. 2000; Leisch & Rossini 2003; Deelman & Gils 2006; Gentleman & Temple-Lang 2007] With few exceptions [notably, Freire, et al. 2006] this focuses primarily on “simple replication” or “reproduction” –replicating exactly a precise set of result from an exact copy of original data made at the time of research.

Current leading examples of tools that support reproducible computation include:


The second approach focuses on data sharing methods and tools [see for example, Altman et al 2001; King 2007; Anderson et al., 2007; Crosas 2011]. [1]  This approaches more generally on helping researchers to share — both for replication and for broader reuse – including secondary uses and use in teaching. Increasingly work in this area [e.g. Gutmann 2009; Altman-King 2007] focuses on issues of enabling long-term and interdisciplinary access to data – this requires that the researchers’ tacit knowledge about data formats, measurement, structure and provenance be more explicitly documented.

Current leading examples of informatics tools that support data sharing include:


The third approach focuses on the norms, practices and licensing associated with data sharing archiving and replication and the related incentives embedded in scholarly communication [Pienta 2007; Hamermesh 2007; Altman & King 2007; King 2007; Hedstrom et al. 2008; McCullough 2009; Stodden 2009].  This approach seeks to create the necessary conditions to enable data sharing and reuse, and to examine and align citations around citation, data sharing, and peer review to encourage replicability and reusability.

Current leading examples of informatics tools that support richer citation, evaluation, open science, and review include:


Many Tools, Few Solutions

In this area, there are many useful tools, but few solutions that offer a complete solution – even for a specialized community of practice. All three approaches are useful, and here are several general observations to be made about them. First, tools for replicable research such as VisTrails, MyExperiment, Wings, and StatDocs are characterized by their use of a specific and controlled defined software framework and their ability to facilitate near automatic replication. The complexity of these tools, and their small user and maintenance base means that we cannot rely on them to exist and function in five-ten years – they cannot ensure long term access. Because they focus only on results and not on capturing practices, descriptive metadata and documentation, they allow exact replication without providing the contextual information necessary for broader reuse.  Finally these tools are heterogeneous across subdisciplines, and largely incompatible, they do not as yet offer a broadly scalable solution.

Second, tools and practices for data management have the potential to broadly increase data sharing and the impact of related publications However, although these tools are becoming easier to use, they still require an extra effort for the researcher. Moreover, since additional effort often comes near (or past) the conclusion of the main research project (and only after acceptance of an article and preparation for final publication) it is perceived as a burden, and often honored in the breach.

Third, incentives for replication have been weak in many disciplines – and journals are a key factor. The reluctance of journal editors to publish articles either confirming or non-confirming replications work authors’ incentives to create replicable work.  Lack of formal provenance and attribution practices for data also weakens accountability, raises barriers to conducting replication and reuse, reduces incentive to disseminate data for reuse, and increases the ambiguity of replication studies, making them difficult to study.

Furthermore, new forms of evidence complicate replication and reuse. In most scientific disciplines, the amount of data potentially available for research is increasing non-linearly.  In addition, changes in technology and society are greatly affecting the types and quantities of potential data available for scientific analysis, especially in the social sciences. This presents substantial challenges to the future replicability and reusability of research. Traditional data archives currently consist almost entirely of numeric tabular data from noncommercial sources. New forms of data differ from tabular data in size, format, structure, and complexity. Left in its original form, this sort of data is difficult or for scholars outside of the project that generated it to interpret and use. This is a barrier to integrative and interdisciplinary research, but also a significant obstacle to providing long-term access, which becomes practically impossible as the tacit knowledge necessary to interpret the data is forgotten. To enable broad use and to secure long term access requires more than simply storing the individual bits of information – it requires establishing and disseminating good data management practices. [Altman & King 2007]

How research libraries can jump-start the process.

Many research libraries should consider at least three steps:

1. Create a dataverse hosted by the Harvard Dataverse Network ( ). This provides free, permanent storage, dissemination, with bit-level preservation insured by Harvard’s endowment.  The dataverse can be branded, curated, and controlled by the library – so it enables libraries to maintain relationship with their patrons, and provide curation services, with minimal effort. (And since DVN is open-source, a library can always move from the hosted service to one they run themselves.

2. Link to DMPTool ( from your libraries website. And consider joining DMPTool as an institution – especially if you use Shibboleth (Internet2) to authorize your users.   You’ll be in good company — according to a recent ARL survey 75% of  ARL libraries are now at least linking to DMPTool.  Increasing researchers use  of DMPtool provides early opportunities for conversation with libraries around data, enables libraries to offer service at a time when it is salient to the researcher , and provides a information which can be used to track and evaluate data management planning needs.

3. Publish a “libguide” focused on helping researchers get more credit for their work.  This is a subject of intense interest, and the library can provide information about trends and tools in the area that researchers (especially junior researchers) of which researchers may not be aware. Some possible topics to include: Data citation (e.g. the ); researcher identifiers (e.g., ); and impact metrics ( .






Altman, M., L. Andreev, M.  Diggory, M. Krot, G. King, D. Kiskis, A. Sone, S. Verba,  A Digital Library for the Dissemination and Replication of Quantitative Social Science Research, Social Science Computer Review 19(4):458-71. 2001.

Altman, M. and G. King. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib Magazine 13(3/4).  2007.

Anderson, R.  W. H. Greene, B. D. McCullough and H. D. Vinod. “The Role of Data/Code Archives in the Future of Economic Research,” Journal of Economic Methodology. 2007.

Buckheit, J. and D.L. Donoho,Wavelan and Reproducible Research, in A. Antoniadis (ed.) Wavelets and Statistics, Springer-Verlag. 1995.

Crosas, M., The Dataverse Network®: An Open-Source Application for Sharing, Discovering and Preserving Data, D-lib Magazine 17(1/2). 2011.

D.S. Hamermesh, “Viewpoint: Replication in Economics,” Canadian Journal of Economics. 2007.

Deelman, E. Y. Gil, (Eds.). Final Report on Workshop on the Challenges of Scientific Workflows.  2006. <>

Freire, J., C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T. Vo. Managing rapidly-evolving scientific workflows. In International Provenance and Annotation Workshop (IPAW), LNCS 4145, 10-18, 2006.

Gentleman R., R. Temple Lang. Statistical Analyses and Reproducible Research,  Journal of Computational and Graphical Statistics 16(1): 1-23. 2007.

Gutmann M., M. Abrahamson, M. Adams, M. Altman, C.  Arms,  K.  Bollen, M. Carlson, J. Crabtree, D. Donakowski, G. King, J. Lyle, M. Maynard, A. Pienta, R. Rockwell, L. Timms-Ferrara, C. Young,  “From Preserving the Past to Preserving the Future: The Data-PASS Project and the challenges of preserving digital social science data”, Library Trends 57(3):315-337. 2009.

Hedstrom, Margaret, Jinfang Niu, Kaye Marz,. “Incentives for Data Producers to Create “Archive/Ready” Data: Implications for Archives and Records Management”, Proceedings of the Society of American Archivists Research Forum. 2008.

King, G. “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing.” Sociological Methods and Research, 32(2), 173–199. 2007.

Knuth, D.E., Literate  Programming, CLSI Lecture Notes 27. Center for the Study of Language and Information.  Stanford, Ca. 1992.

Leisch F., and A.J. Rossini, Reproducible Statistical Research, Chance 16(2): 46-50. 2003.

McCullough, B.D., Open Access Economics Journals and the Market for Reproducible Economic Research, Economic Analysis & Policy 39(1). 2009.

Pienta, A., LEADS Database Identifies At-Risk Legacy Studies, ICPSR Bulletin 27(1) 2006.

Schwab, M., M. Karrenbach, and J. Claerbout, Making Scientific Computations Reproducible, Computing in Science and Engineering 2: 61-67. 2000.

Stodden, V.The Legal Framework for Reproducible Scientific Research: Licensing and Copyright, Computing in Science and Engineering 11(1):35-40. 2009.



[1] Also see for example the CRAN reproducible research task view:;  and the Reproducible Research tools page:

Mar 27, 2:53pm

FTC  has been hosting a series of seminars on consumer privacy, on which it has requested comments. The most recent seminar explored privacy issues related to mobile device tracking. As the seminar summary points out …

In most cases, this tracking is invisible to consumers and occurs with no consumer interaction. As a result, the use of these technologies raises a number of potential privacy concerns and questions.

The presentations raised an interesting and important combination of questions about how to promote business and economic innovation while protecting individual privacy.  I have submitted a comment on these changes with some proposed recommendations.

To summarize (quoting from the submitted the comment):

Knowledge of an individual’s location history and associations with others has the potential to be used in a wide variety of harmful ways. … [Furthermore], since all physical activity has a unique spatial and temporal context, location history provides a linchpin for integrating multiple sources of data that may describe an individual. [R]esearch shows that human mobility patterns are highly predictable and that these patterns have unique signatures, making them highly identifiable– even in the absence of associated identifiers or hashes. Moreover, locational traces are difficult or impossible to render non-identifiable using traditional masking methods.

I invite you to read the full comment here:

This comment drew heavily on previous comments on proposed OSHA regulation made with colleagues at the Berkman Center, David O’Brien, Alexandra Woods, was made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.


Mar 08, 12:03pm

OSHA has proposed a set of set of changes to current tracking of workplace injuries and illnesses.

Currently information about workplace injuries and illnesses must be recorded, but only on paper. Further most of this information is never reported —  OSHA only receives detailed information  when it conducts an investigation, and receives  summary records from  only a small percentage of employers who are selected to participate in the annual survey. (Additionally BLS receives a sample of this information in order to produce specific statistics for  its “Survey of Occupational Injuries and Illnesses”

OSHA proposes three changes. The first change would require establishment to regularly  submit the information that they are already required to collect  and maintain (quarterly submission of detailed information for larger establishment, and annual submission of summary information from any establishment with more than twenty employees that is already required to maintain these records) . The second change makes this process digital — submissions would be electronic, instead of on paper. And the third change would be to make the data collected public — searchable, and downloadable in machine-actionable (.csv) form.

 These proposed changes raise an interesting and important combination of questions about how to promote government (and industry) transparency while protecting individual privacy. My colleagues at the Berkman Center, David O’Brien, Alexandra Woods, and I have submitted an extensive comment on these changes with some proposed recommendations. This comment is made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.

To summarize (quoting from the conclusions of the comment):

We argue that workplace injury and illness records should be made more widely available because releasing these data has substantial potential individual, research, policy, and economic benefits. However, OSHA has a responsibility to apply best practices to manage data privacy and mitigate potential harms to individuals that might arise from data release.

The complexity, detail, richness, and emerging uses for data create significant uncertainties about the ability of traditional ‘anonymization’ and redaction methods and standards alone to protect the confidentiality of individuals. Generally, one size does not fit all, and tiered modes of access – including public access to privacy-protected data and vetted access to the full data collected – should be provided.

Such access requires thoughtful analysis with expert consultation to evaluate the sensitivity of the data collected and risks of re-identification and to design useful and safe release mechanisms.

I invite you to read the full comment here:

Mar 05, 3:21pm

Sound, reproducible scholarship rests upon a foundation of robust, accessible data.  For this to be so in practice as well as theory, data must be accorded due importance in the practice of scholarship and in the enduring scholarly record.  In other words, data should be considered legitimate, citable products of research.

A few days ago I was  honored to officially announce the  Data Citation Working Group’s Joint Declaration of Data Citation Principles  at IDCC 2014, from which the above quote is taken.

This Joint Data Citation Principles identifies guiding principles for the scholarly citation of data. This recommendation is a s collaborative work with CODATA, FORCE 11, DataCite and many other individuals and organizations.   And in the week since it has been released, it has already garnered over twenty institutional endorsements.

Some slides introducing the principles are here:

To summarize, from 1977 through 2009 there were three phases of development in the area of data citation.

  • The first phase of development focused on the role of citation to facilitate description and information retrieval. This phase introduced the principles that data in archives should be described as works rather than media, using author, title, and version.
  •  The second phase of development extended citations to support data access and persistence. This phase introduced the principles that research data used by publication should be cited, that those citations should include persistent identifiers, and that the citations should be directly actionable on the web.
  •  The third phase of development focused on using citations for verification and reproducibility. Although verification and reproducibility had always been one of the motivations for data archiving – it had not been a focus of citation practice. This phase introduced the principles that citations should support verifiable linkage of data and published claims, and it started the trend towards wider integration with the publishing ecosystem

And over the last five years the importance and urgency of scientific data management and access has been  recognized more broadly. The culmination of this trend toward increasing recognition, thus far, is an increasingly widespread consensus by researchers and funders of research that data is a fundamental product of research and therefore a citable product. The fourth and current phase of data development work focuses on integration with the scholarly research and publishing ecosystem. This includes integration of data citation in standardized ways within publication, catalogs, tool chains, and larger systems of attribution.

Read the full recommendation here, along with examples, references and endorsements:

 Joint Declaration of Data Citation Principles

Feb 13, 2:34pm

Our guest speaker,  Cavan Capps,  who is Big Data Lead services presented this  talk as part of the Program on Information Science Brown Bag Series.

Cavan Capps is the U.S. Census Bureau’s Lead on Big Data processing. In that role he is focusing on new Big Data sources for use in official statistics, best practice private sector processing techniques and software/hardware configurations that may be used to improve statistical processes and products. Previously, Mr. Capps initiated, designed and managed a multi-enterprise, fully distributed, statistical network called the DataWeb.

Capps provided the following summary of his talk.

Big Data provides both challenges and opportunities for the official statistical community.  The difficult issues of privacy, statistical reliability, and methodological transparency will need to be addressed in order to make full use of Big Data in the official statistical community.  Improvements in statistical coverage at small geographies, new statistical measures, more timely data at perhaps lower costs are the potential opportunities. This talk will provides an overview of some of the research being done by the Census Bureau as it explores the use of “Big Data” for statistical agency purposes.

And he has also described the U.S. census efforts to incorporate big data in in this article:

What struck me most about Capps’ talk and the overall project is how many disciplines have to be mastered to find an optimal solution.

  • Deep social science knowledge (especially economics, sociology, psychology, political science) to design the right survey measures, and to come up with theoretically and substantively coherent alternative measures;
  • carefully designed machine learning algorithms are needed to extract actionable information from non-traditional data sources; 
  • advances in statistical methodology are needed to guide adaptive survey design; make reliable inferences over dynamic social networks; and to measure and correct for bias in measures generated from non-traditional data sources and non-probability samples;
  • large scale computing is needed to do this all in real time;
  • information privacy science is required to ensure the results that are released (at scale, and in real time) continue to maintain the public trust; And…
  • information science methodology is required to ensure the quality, versioning, authenticity, provenance and reliability that is expected of the US Census.

This is indeed a complex project. And also given the diversity of  areas implicated, a stimulating one — it has resonated with many different projects and conversations at MIT.

Feb 07, 8:49am

When is “free to use” not free? … when it doesn’t establish  specific rights.

NISO offered a recent opportunity to comment on the draft recommendation on ‘Open Access Metadata and Indicators’. The following is my commentary on the draft recommendation. You may also be interested in reading the other commentaries on this draft.

Response to request for public comments on on ‘Open Access Metadata and Indicators’

Dr. Micah Altman
Director of Research; Head/Scientist, Program on Information Research — MIT Library, Massachusetts Institute of Technology
Non-Resident Senior Fellow, The Brookings Institution


Thank you for the opportunity to respond to this report. Metadata and indicators for Open Access publications are an area in which standardization would benefit the scholarly community. As a practicing social scientist, a librarian, and as MIT’s representative to NISO, I have worked extensively with (and contributed to the development of) metadata schemas, open and closed licenses, and open access publications. My contribution is made with this perspective.



The scope of the work originally approved by NISO members was to develop metadata and visual indicators that would enable a user to determine whether a specific article is openly accessible, and what other rights are available. The current draft is more limited in in two ways: First, the draft does not address visual indicators. Second, the metadata field proposed in the draft signals only the license available without providing information on specific usage rights.

The first limitation is a pragmatic limitation of scope. Common, well-structure metadata is a pre-condition for systematic and reliable visual indicators. These indicators may be developed later by NISO or other communities.

The second limitation in scope is more fundamental and problematic. Metadata indicating licenses is less directly actionable.  A user can take direct actions (e.g., to read, mine, disseminate, or reuse content) based on knowledge of rights, but cannot take such actions knowing only the URI of the license — unless the user has independently determined what rights are associated with every license encountered. Moreover, different users may (correctly or incorrectly) interpret the same license as implicating different sets of rights. This creates both additional effort and risk for users, which greatly limits the potential value of the proposed practice.

The implicit justification for this limitation of scope is not clearly argued. However it seems to be based the claims, made on page 2, that “This is a contentious area where political views on modes of access lead to differing interpretations of what constitutes ‘open access’” and  “Considering the political, legal, and technical issues involved, the working group agreed that a simple approach for transmitting a minimal set of information would be preferred.”  [1]

This is a mistake, in my judgment, since contention (or simply uncertainty) over the definition of “open access” notwithstanding,  there is a well-established and well-defined set of core criteria that apply to open licenses; these include: the 10 criteria comprising the Open Source Initiatives Open Source Definition [2]; the 11 criteria comprising the Open Knowledge Foundation Open Definition (many of which are reused from the OSI criteria) [3]; and the four license properties defined by creative commons. [4] These criteria currently are readily applicable to dozens of existing independently-created open licenses [5], which have been applied to millions of works. Although these properties may not be comprehensive, nor is there a universal agreement over which of these properties constitute “open access”, any plausible definition of Open Access should include at least one of these properties. Thus a metadata schema that could be used to signal these properties is feasible, and could be reliably used to indicate useful rights. These right elements should be added to complement the proposed license reference tag, which could be used to indicate other rights not covered in the schema.

Design Issues

The proposed draft lists nine motivating use cases. The general selection of use cases appears appropriate. However, the definition of success for each use case is not clearly defined, making the claim that use cases are satisfied arguable. Moreover, the free_to_read element does adequately address the use cases to which it is a proposed solution. The free_to_read element is defined as meaning that “content can be read or viewed by any user without payment or authentication” (pg 4) the purpose is to  “provide a very simple indication of the status of the content without making statements about any additional re-use rights or restrictions.” No other formal definition of usage rights or conditions for this element is provided in the draft.

Under the stated definition, rights  other than reading could be curtailed in any variety of ways —  including for example a restriction on the right to review, criticize or comment upon the material.  Thus the rights implied by the free_to_read  element are less then the minimal criteria provided by any plausible open access license. Furthermore, this element is claimed to comprise part of the solution to compliance evaluation use cases (use cases 5.8 and 5.9 in the draft). It cannot support that purpose – compliance auditing relies upon well-defined criteria, and the free_to_read  definition is fatally ambiguous. The free_to_read element should be removed from the draft. It should be replaced by a metadata attribute indicating readability as defined by metadata indicating ‘access’ rights, as defined by the open criteria listed above. [6]

Technical Issues

  • Finally, there are a number of changes to the technical implementation proposed:
  • There should be a declared namespace for the proposed XML license elements, so they can be used in a structured way in XML documents without including them separately in multiple schema definitions
  •  Semantic markup (e.g. RDF) is required so that these elements may be used in non-XML metadata
  •  A schema should be supplied that formally and unambiguously defines: which elements and attributes are required, which are repeatable, what datatypes (e.g. date formats) are allowable, and any implicit default values
  • Make explicit that license_ref may  refer to waivers (e.g. CC0) as well as licenses.


[1] The draft also raises the concern, on page 4, that “no legal team is going to agree to allow any specific use based on metadata unless they have agreed that the license allows.” The evidence for this assertion is unclear. However, even if true, it can easily be addressed by using the criteria above to design standard rights metadata profiles for each license to complement the metadata attributes associated with individual document. A legal team can then vet the profiles associated with a license, could certify a registry responsible for maintaining such profiles, or could agree to accept profiles available from the licenses authors.





[6] Alternatively, if the definition of metadata elements is simply beyond the capacity of this project, free_to_use should simply be replaced by a license_ref instance that includes a URI to a well known license that  established free_to_read. The latter could even be designed for this purpose. This would at least remove the ambiguity of the free_to_read condition, and further simplify the schema.