Nov 23, 9:09am

My colleague, Nancy McGovern,  who is Head of Curation and Preservation services presented this  as part of the Program on Information Science Brown Bag Series.

DIPIR  employs qualitative and quantitative data collection to investigate the reuse of digital data in  quantitative social sciences, archaeology, and zoology. It’s main research focus is on significant properties.

The team has also recently published an evaluation of the perception of researchers about what constitutes a trustworthy repository.  In DIPIR’s  sample, perceptions of trust were influenced by transparency, metadata quality, data cleaning, and  reputation with colleagues. Notably absent are such things as certifications, sustainable business models, etc. Also, as in most studies of trust in this area,  the context of “trust”  is left open — the factors that make an entity trustworthy as a source of  information are different than those that might make cause one to trust an entity to preserve deposited with it. Since researchers tend to use data repositories for both, its difficult to tease these apart.

Oct 31, 3:05pm

Personal information is ubiquitous and it is becoming increasingly easy to link information to individuals. Laws, regulations and policies  governing information privacy are complex, but most intervene through either access or  anonymization at the time of data publication.

Trends in information collection and management — cloud storage, “big” data,  and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective.

This session presented as part of the the Program on Information Science seminar series,  examines trends information privacy. And the session will also discuss emerging approaches and research around managing confidential research information throughout its lifecycle. This is in part a result of research through the Privacy Tools project.

Sep 14, 1:01pm

Much of what we know about scholarly communication and the “science of science” relies on  the scholarly record”of journal publications, monographs, and books;  and  upon the patterns of findings, evidence, and collaborations that analysis of this record reveals. In contrast, research data, in its current state, represents a type of  ’scholarly dark matter’ that underlies the current visible evidentiary relationships among publications. Improved data citation practices have the potential to make this dark matter visible.

Yesterday the Data Science Journal  published a special issue devoted to data citations: Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data. This  is a comprehensive review of data citations principles, practices, infrastructure, policy and research. And I’m very pleased to have contributed to writing and researching  this document  as part of the  CODATA-ICSTI Task Group on Data Citation Standards and Practices. 

This is a rapidly evolving area, and representatives from the CODATA-ICSTI task group,  Force 11, the Research Data Alliance and a number of other groups, have formed a  synthesis group  which is developing an integrated statement of principles  to promote broad adoption of a consistent policy for data citation across disciplines and venues.

Sep 05, 3:21pm

My collaborator Michael McDonald  and I have been analyzing the data that resulted from the crowd-sourcing participative electoral mapping projects we were involved in and other public redistricting efforts, and this blog includes two earlier articles from this line of research. In this research article, to appear in the Proceedings of the 47th Annual Hawaii International Conference on System Sciences (IEEE/Computer Society Press) we reflect on initial lessons learned about public participation and technology from the last round of U.S. electoral mapping.

Three major factors influenced the effectiveness of efforts to increase public input into the political process through crowdsourcing. First, open electoral mapping tools were a practical necessity to enable substantially greater levels increase public participation.  Second, the interest and capacity of local grassroots organizations was critical to catalyzing the public to engage using these tools.  Finally, the permeability of government authorities to public input was needed for such participation to have a significant effect.

The impermeability of government to public input in a democratic state can take a number of more-or-less subtle forms, each of which was demonstrated in the last round of electoral mapping: Authorities blatantly resist public input by providing no recognized channel for it; or by creating a nominal channel, but leaving it devoid of funding or process; or procedurally accepting input, but substantively ignoring it

Authorities can also resist public participation and transparency indirectly through the way they make essential information available to the public. For example, mapping authorities that do not wish to have potential political consequences of their plans easily evaluated publicly will not provide election results merged with census geography — although they assuredly use such merged information for internal evaluation of their plans.  Redistricting authorities may purposefully restrict the scope of the information they make available. For example, a number of states chose to make available boundaries and information related to the approved plan only. Another subtle way by which authorities can hinder transparency is by releasing information plans in a non-machine readable format. An even more subtle, but substantial barrier is the interface through which representations of plans are made available.

This resistance appears to have been in large part, effective. Public participation increased by an order of magnitude in the last round of redistricting. However, except in a few exemplary cases, visible direct effects on policy outcomes appears modest. You can find more details, in the article.

Aug 11, 7:30am

Data citation supports  attribution, provenance, discovery, provenance, and persistence. It is not (and should not be) sufficient for all of these things, but its an important component. In the last 2 years, there have been several major efforts to standardize data citation practices, build citation infrastructure, and  analyze data citation practices.

This session presented as part of the the Program on Information Science seminar series,  examines data citation from an information lifecycle approach: what are the use cases, requirements and research opportunities. And the session will also discuss emerging infrastructure and standardization efforts around data citation.

A number of principles have emerged for citation — the most central is that data citations should be treated consistently with citations to other objects:Data citations should at least provide the minimal core elements expected in other modern citations; should be included in the references section along with citations to other elements; and indexed in the same way.

Adoption of data citation by journals can  provide positive and sustainable incentives for more reproducible science and more complete attribution. This would act to brighten the dark matter of science — revealing connections among evidence bases that are not now visible through citations of articles.

Jul 30, 11:08am

Digital stewardship is vital for the authenticity of public records, the reliability of scientific evidence, and the enduring accessibility to our cultural heritage. Knowledge of ongoing research, practice, and organizational collaborations has been distributed widely across disciplines, sectors, and communities of practice.  A few days ago I was  honored to officially announce the NDSA’s National Agenda for Digital Stewardship at Digital Preservation 2013.  This identifies the highest-impact opportunities to advance the state of the art; the state of practice; and the state of collaboration within the next 3-5 years.

The 2014 Agenda integrates the perspective of dozens of experts and hundreds of institutions, convened through the Library of Congress. It outlines the challenges and opportunities related to digital preservation activities in four broad areas: Organizational Roles, Policies, and Practices; Digital Content Areas; Infrastructure Development; and Research Priorities.

Slides and video of the short (5-min) talk below:

Read the full report here:

Jul 25, 4:49pm

This presentation, invited for a workshop  on data preservation for open science,  held at JCDL 2013, gives a brief tour of a large topic — how do we understand the types of data and software used in social science research. In this presentation I characterize the intellectual landscape  across 9 dimensions of data structure, content, measure, and use. I then use this framework to characterize  three interesting use cases.

This illustrates some particular challenges for long-term access to and replication of social science research, including the use of “messy” human sensors; the wide mix of data types, structures, sparsity; complex legal constraints; pervasive use of manual and computer-aided coding; use of niche commercial software and bespoke software; and very long-term access requirements.

Jul 19, 5:43pm

MIT has a wonderful tradition of offering a variety of short courses during the Winter hiatus, known as IAP. The MIT Libraries generally offers dozens of courses on data and information management (among other topics) during this time. Moreover, the Libraries also hold bonus IAP sessions in April and July.

So, for this year’s JulyAP, I updated a tutorial on managing confidential data, continuing to integrate the things we’ve learned in our privacy research project. This short course focuses on the practical side of managing confidential research data, but gives some pointers to the research areas.

Tslides are below:

Jul 07, 9:23am

My collaborator Micheal McDonald and I are now analyzing the data that resulted from the crowd-sourcing participative electoral mapping projects we were involved in and others. Our earlier article, analyzing redistricting in Virginia established that  members of the public are capable of creating legal redistricting plans, and in many ways perform better than the legislature in doing so.

Our latest analysis, baed on data from Congressional redistricting in Florida, reinforces this finding. Furthermore, it highlights some of the limits of reform efforts, and the structural tradeoffs among redistricting criteria.

The reforms to process in Florida, catalyzed by advances in information technology, enabled a dramatic increase in public participation in the redistricting process. This reform process in Florida can be considered a partial success: The adopted plan implements one the the most efficient observable trade-offs among the reformer’s criteria, primarily along the lines of racial representation by creating an additional Black-majority district in the form of the current 5th Congressional District. This does not mean, however, that reform was entirely successful. The adopted plan is efficient, but is atypical of the plans submitted by the legislature and public. Based on the pattern of public submissions, and on contextual information, we suspect the adopted plan was drawn for partisan motivations. The public preference and good-government criteria might be better served by the selection of the other efficient plans – that were much more competitive, and less biased, at the cost of a reduction of the majority-minority seat.

Most of these trends can be made clear through information visualization methods. The figure below draws a line representing the Pareto (efficient) frontier in two dimensions to illustrate the major trade-offs between number of partisan seats and other criteria.

One of the visuals is below — it shows the tradeoffs between partisan advantage and different representational criteria, based on the pareto-frontier of plans publicly available.

(Click on the image below to enlarge)


These frontiers suggest that some criteria are more constraining on Democratic redistricters than on Republican redistricters. On the one hand   equipopulation is equally constraining on Democratic and Republic partisan seat creatio. However, the data suggests a structural trade-off  between and Black-majority seats and Democratic seats.

More details on the data, evaluations, etc. appear in the article (uncorrected manuscript).

Jul 02, 11:01am

The structure and design of digital storage systems is a cornerstone of digital preservation. To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This talk, presented as part of the the Program on Information Science seminar series, discusses findings from this survey, common gaps, and trends in this area.

(I also have a little fun highlighting the hidden assumptions underlying Amazon Glacier’s reliability claims. For more on that see this earlier post.)