Sep 05, 3:21pm

My collaborator Michael McDonald  and I have been analyzing the data that resulted from the crowd-sourcing participative electoral mapping projects we were involved in and other public redistricting efforts, and this blog includes two earlier articles from this line of research. In this research article, to appear in the Proceedings of the 47th Annual Hawaii International Conference on System Sciences (IEEE/Computer Society Press) we reflect on initial lessons learned about public participation and technology from the last round of U.S. electoral mapping.

Three major factors influenced the effectiveness of efforts to increase public input into the political process through crowdsourcing. First, open electoral mapping tools were a practical necessity to enable substantially greater levels increase public participation.  Second, the interest and capacity of local grassroots organizations was critical to catalyzing the public to engage using these tools.  Finally, the permeability of government authorities to public input was needed for such participation to have a significant effect.

The impermeability of government to public input in a democratic state can take a number of more-or-less subtle forms, each of which was demonstrated in the last round of electoral mapping: Authorities blatantly resist public input by providing no recognized channel for it; or by creating a nominal channel, but leaving it devoid of funding or process; or procedurally accepting input, but substantively ignoring it

Authorities can also resist public participation and transparency indirectly through the way they make essential information available to the public. For example, mapping authorities that do not wish to have potential political consequences of their plans easily evaluated publicly will not provide election results merged with census geography — although they assuredly use such merged information for internal evaluation of their plans.  Redistricting authorities may purposefully restrict the scope of the information they make available. For example, a number of states chose to make available boundaries and information related to the approved plan only. Another subtle way by which authorities can hinder transparency is by releasing information plans in a non-machine readable format. An even more subtle, but substantial barrier is the interface through which representations of plans are made available.

This resistance appears to have been in large part, effective. Public participation increased by an order of magnitude in the last round of redistricting. However, except in a few exemplary cases, visible direct effects on policy outcomes appears modest. You can find more details, in the article.

Aug 11, 7:30am

Data citation supports  attribution, provenance, discovery, provenance, and persistence. It is not (and should not be) sufficient for all of these things, but its an important component. In the last 2 years, there have been several major efforts to standardize data citation practices, build citation infrastructure, and  analyze data citation practices.

This session presented as part of the the Program on Information Science seminar series,  examines data citation from an information lifecycle approach: what are the use cases, requirements and research opportunities. And the session will also discuss emerging infrastructure and standardization efforts around data citation.

A number of principles have emerged for citation — the most central is that data citations should be treated consistently with citations to other objects:Data citations should at least provide the minimal core elements expected in other modern citations; should be included in the references section along with citations to other elements; and indexed in the same way.

Adoption of data citation by journals can  provide positive and sustainable incentives for more reproducible science and more complete attribution. This would act to brighten the dark matter of science — revealing connections among evidence bases that are not now visible through citations of articles.

Jul 30, 11:08am

Digital stewardship is vital for the authenticity of public records, the reliability of scientific evidence, and the enduring accessibility to our cultural heritage. Knowledge of ongoing research, practice, and organizational collaborations has been distributed widely across disciplines, sectors, and communities of practice.  A few days ago I was  honored to officially announce the NDSA’s National Agenda for Digital Stewardship at Digital Preservation 2013.  This identifies the highest-impact opportunities to advance the state of the art; the state of practice; and the state of collaboration within the next 3-5 years.

The 2014 Agenda integrates the perspective of dozens of experts and hundreds of institutions, convened through the Library of Congress. It outlines the challenges and opportunities related to digital preservation activities in four broad areas: Organizational Roles, Policies, and Practices; Digital Content Areas; Infrastructure Development; and Research Priorities.

Slides and video of the short (5-min) talk below:

Read the full report here:

Jul 25, 4:49pm

This presentation, invited for a workshop  on data preservation for open science,  held at JCDL 2013, gives a brief tour of a large topic — how do we understand the types of data and software used in social science research. In this presentation I characterize the intellectual landscape  across 9 dimensions of data structure, content, measure, and use. I then use this framework to characterize  three interesting use cases.

This illustrates some particular challenges for long-term access to and replication of social science research, including the use of “messy” human sensors; the wide mix of data types, structures, sparsity; complex legal constraints; pervasive use of manual and computer-aided coding; use of niche commercial software and bespoke software; and very long-term access requirements.

Jul 19, 5:43pm

MIT has a wonderful tradition of offering a variety of short courses during the Winter hiatus, known as IAP. The MIT Libraries generally offers dozens of courses on data and information management (among other topics) during this time. Moreover, the Libraries also hold bonus IAP sessions in April and July.

So, for this year’s JulyAP, I updated a tutorial on managing confidential data, continuing to integrate the things we’ve learned in our privacy research project. This short course focuses on the practical side of managing confidential research data, but gives some pointers to the research areas.

Tslides are below:

Jul 07, 9:23am

My collaborator Micheal McDonald and I are now analyzing the data that resulted from the crowd-sourcing participative electoral mapping projects we were involved in and others. Our earlier article, analyzing redistricting in Virginia established that  members of the public are capable of creating legal redistricting plans, and in many ways perform better than the legislature in doing so.

Our latest analysis, baed on data from Congressional redistricting in Florida, reinforces this finding. Furthermore, it highlights some of the limits of reform efforts, and the structural tradeoffs among redistricting criteria.

The reforms to process in Florida, catalyzed by advances in information technology, enabled a dramatic increase in public participation in the redistricting process. This reform process in Florida can be considered a partial success: The adopted plan implements one the the most efficient observable trade-offs among the reformer’s criteria, primarily along the lines of racial representation by creating an additional Black-majority district in the form of the current 5th Congressional District. This does not mean, however, that reform was entirely successful. The adopted plan is efficient, but is atypical of the plans submitted by the legislature and public. Based on the pattern of public submissions, and on contextual information, we suspect the adopted plan was drawn for partisan motivations. The public preference and good-government criteria might be better served by the selection of the other efficient plans – that were much more competitive, and less biased, at the cost of a reduction of the majority-minority seat.

Most of these trends can be made clear through information visualization methods. The figure below draws a line representing the Pareto (efficient) frontier in two dimensions to illustrate the major trade-offs between number of partisan seats and other criteria.

One of the visuals is below — it shows the tradeoffs between partisan advantage and different representational criteria, based on the pareto-frontier of plans publicly available.

(Click on the image below to enlarge)


These frontiers suggest that some criteria are more constraining on Democratic redistricters than on Republican redistricters. On the one hand   equipopulation is equally constraining on Democratic and Republic partisan seat creatio. However, the data suggests a structural trade-off  between and Black-majority seats and Democratic seats.

More details on the data, evaluations, etc. appear in the article (uncorrected manuscript).

Jul 02, 11:01am

The structure and design of digital storage systems is a cornerstone of digital preservation. To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This talk, presented as part of the the Program on Information Science seminar series, discusses findings from this survey, common gaps, and trends in this area.

(I also have a little fun highlighting the hidden assumptions underlying Amazon Glacier’s reliability claims. For more on that see this earlier post.)

Jun 26, 3:11pm

Three weeks ago, NIH issued a request for information to solicit comments on the development  an NIH Data Catalog as part of its overall Big Data to Knowledge (BD2K) Initiative.

The Data Preservation Alliance for Social Sciences issued a response to which I contributed. Two sections are of general interest to the library/stewardship community:

Common Data Citation Principles and Practices


While there are a number of different communities of practice around data citation, a number of common principles and practices can be identified. 


The editorial policy Science [see ] is an exemplar of two principles  for data citations: First, that published claims should cite the evidence and methods upon which they rely, and second, that things cited should be available for examination by the scientific community.  These principles have been recognized across a set of communities and expert reports, and are increasingly being adopted by number of other leading journals. [See Altman 2012; CODATA-ICSTI Task Group, 2013; and]


Implementation Considerations


Previous policies aiming to facilitate open access to research data have often failed to achieve their promise in implementation. Effective implementation requires standardizing core practices, aligning stakeholder incentives, reducing barriers to long-term access, and building in evaluation mechanisms.


A set of core recognized good practices have emerged that span fields. Good practice includes separating the elements of citation from the presentation; including in the elements identifier, title, author, and date information, and where at all possible version and fixity information; and listing data citations in the same place as citation to other works – typically in the references section. [See Altman-King 2006; Altman 2012; CODATA-ICTI Task Group 2013; ; ]


Although the incentives related to data citation and access are complex, there are a number of simple points of leverage. First, journals can both create positive incentives for sharing data by requiring that data be properly cited. Second, funders can require that only those outputs of research that comply with access and citation policies can be claimed as results from prior research.

You may read the full response on the Data-PASS site. 

Jun 20, 10:31am

Metdata can be defined variously as “data about data”, digital ‘breadcrumbs’, magic pixie dust, and “something that everyone now knows the NSA wants a lot of”. It’s all of the above.

Metadata is used to support decision and workflows, add value to objects (through enhancing discover, use, reuse, and integration), and to support evaluation and analysis. It’s not the whole story for any of these things, but it can be a big part.

This presentation, invited for a workshop on Open Access and Scholarly Books (sponsored by the Berkman Center and Knowledge Unlatched),  provides a very brief overview of metadata design principles, approaches to evaluation metrics, and some relevant standards and exemplars in scholarly publishing. It is intended to provoke discussion on approaches to evaluation of the use, characteristics, and value of OA publications.

Jun 11, 11:51am

Best practices aren’t.

The core issue is that there are few models for the systematic valuation of data:  We have no robust general proven ways of answering the question of how much  data X be worth to community Y at time Z. Thus the “bestness” (optimality) of practices are generally strongly dependent on operational context.. and the context of data sharing is currently both highly complex and dynamic Until there is systematic descriptive evidence that best practices are used, predictive evidence that best practices are associated with future desired outcomes, and causal evidence that the application of best practices yields improved outcomes, we will be unsure that practices are “best”.

Nevertheless, one should use established “not-bad” practices, for a number of reasons. First, to avoid practices that are clearly bad; second, because use of such practices acts to document operational and tacit knowledge; third because selecting practices can help to elicit the underlying assumptions under which practices are applied; and finally because not-bad practcies provide a basis for auditing, evaluation, and eventual improvement.

Specific not-bad practices for data sharing fall into roughly three categories :

  • Analytic practices: lifecycle analysis & requirements analysis
  • Policy practices for: data dissemination, licensing, privacy, availability, citation and reproducibility
  • Technical practices for sharing and reproducibility, including fixity, replication, provenance

This presentation at the Second Open Economics International Workshop (sponsored by the Sloan Foundation, MIT and OKFN) provides an overview of these and links to specific practices recommendations, standards, and tools: