Jan 18, 9:19am

What’s new in managing confidential research data this year?

For MIT’s independent activities periods (IAP) the Program on Information Science regularly leads a practical workshop on managing confidential data.  This is in part a result of research through the Privacy Tools project.  As I was updating the workshop for this semester, I had an opportunity to reflect upon what’s new on the pragmatic side of managing confidential information.

Most notably, because of the publicity surrounding the NSA, more people (and in higher places) are paying attention.  (And as an information scientist I note that one benefit of the NSA scandal is that everyone now recognizes the term “metadata”).

Also, generally, personal information continues to become more available  and  increasingly easy to link information to individuals. New laws, regulations and policies  governing information privacy continue to emerge, increasing the complexity of management. . Trends in information collection and management — cloud storage, “big” data,  and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective.

On the pragmatic side, new privacy laws continue to emerge at the state level. Probably the most notable is the California “right to be forgotten”  — for teens. This year California became the  the first state to pass a law  (“The Privacy Rights for California Minors in the Digital World”)  that gives (some) individuals the right to remove (some) content they have posted online.
The California law takes effect next year (Jan 1, 2015) — by which time we’re likely to see new information privacy initiatives in some other states. This year wa are also likely to see the release of specific  data sharing requirements from federal funders (as a result of the OSTP “Holdren Memo”, NIH’s big data to knowledge initiative, and related efforts); from journals and from professional societies. Farther off in the wings looms the possibility of a general right to be forgotten law in the EU; changes to how the “common rule” evaluates information risks and controls (on which subject the NAS recently issued a new set of recommendations); and possible “sectoral” privacy laws targeted at “revenge-porn”, “mug-shot” databases, mobile-phone data, or other issues-de-jour.
This creates an interesting tension and will require increasingly sophisticated approaches that can provide both privacy and  appropriate access.  From a policy point of view one possible way of setting this balance is by using “least restrictive terms” language — the OKF’s open economic principles may provide a viable approach.
In a purely operational sense — the biggest change in confidential data management for researchers is the wider availability of “safe-sharing” services for exchanging research data within remote collaborations:
  • On the do-it-yourself front. The increasing flexibility of the FISMA-certified Amazon Web Services  GovCloud makes running a remote, secure research computing environment easier and more economical. Although this still complex and expensive to maintain, and one still has to trust Amazon — although the FISMA certifications make that trust better justified.
  • The second widely used option — combining file-sharing services like DropBox with encrypted filesystems like TrueCrypt also received a boost this year, with the success of a crowdfunded effort to independently audit the TrueCrypt source. This is good news, and the transparency and verifiability of TrueCrypt is its big strength. The approach  remains limited  in practice to secure publishing of information — it doesn’t support simultaneous remote updates (not unless you like filesystem corruption); multiple keys for different users or portions of the filesystem; key distribution — etc.
  • A number of simpler solutions have emerged this year.
    – Bittorrent Sync provides “secure” P2P replication and sharing based on a secret private key.
    – SpiderOak Hive;; and BoxCryptor all offer zero-knowledge cloud-storage, client-side encrypted data sharing. The ease of use and functionality of these systems for secure collaboration is very attractive compared to the other available solutions. BoxCryptor offers an especially wide a range of enterprise features such as  key distribution, revocation, master and group-key-chaining, and other enterprise features, that would make managing sharing among heterogenous groups easier. However, the big downside is the amount of “magic” in these systems. None are open source, nor are any sufficiently well documented (at least externally) or certified (no FISMA, there) to engender trust among us untrusting folk…  ( Although   SpiderOak in particular seems to have a good reputation for trustworthiness…  and the others no doubt have pure hearts, I’d rest easier with the ability to audit source codes, peer-reviewed algorithms, etc.)

For those interested in the meat of the course, which gives an overview of legal, policy, information technology/security, research design, and statistical pragmatics, the new slides are here:

Jan 08, 2:43pm

In December, my colleagues from NDSA and I had the pleasure of attending CNI to present the 2014 National Agenda for Digital Stewardship and to lead a discussion of priorities for 2015. We were gratified to have the company of a packed room of engaged attendees, who participated in a thoughtful and lively discussion.

For those who were unable to attend CNI, the presentation is embedded below.

(Additionally, the Agenda will be discussed this Spring at NERCOMP, in a session I am leading especially for Higher Education IT leaders; at IASSIST in a poster session, represented by Coordination Committee member Jonathan Crabtree; and at IDCC, in a poster session represented by Coordination Committee member Helen Tibbo.)

Discussions of the Agenda at CNI were a first step in the input gathering for the next version of the Agenda. In January, NDSA will start an intensive and systematic process of revising the Agenda for priorities in 2015 and beyond. We expect to circulating these revisions for peer and community review in April and present a final or near-final version (depending on review comments) at the annual Digital Preservation conference in July.

Part of the discussions at our CNI session echoed selected themes in Cliff Lynch’s opening plenary “Perspective” talk, particularly his statements that:

  • We [as a stewardship community] don’t know how well we’re doing with our individual preservation efforts, in general. — We don’t have an inventory of the class of content that is out there, what is covered, and where the highest risks are.

  • There is a certain tendency to “go after the easy stuff”, rather than what’s at risk – our strategy needs to become much more systematic.

In our discussion session these questions were amended and echoed in different forms:

  • What are we doing in the stewardship community, and especially what are we doing well?

  • What makes for collaborative success, and how do we replicate that?

I was gratified that Cliff’s questions resonated well with the summary we’d articulated in the current edition of the National Agenda. The research section, in particular, lays out key questions about information value, risk assessment, and success evaluation, and outlines the types of approaches that are most likely to lead to the development of a systematic, general evidence base for the stewardship community. Moreover, the Agenda calls attention to many examples of things we are doing well.

That said, a question that was posed at our session, and that I heard echoed repeatedly at side conversations during CNI, was “Where are we (as a group, community, project, etc.)  getting stuck in the weeds?”

This question is phrased in a way that attracts negative answers — a potentially positive and constructive rephrasing is: What levels of analysis are most useful for the different classes of problems we face in the stewardship community?

As an information scientist and a formally (and mathematically) trained social scientist, I tend to spend a fair amount of time thinking about and building models of individual, group, and institutional behaviors, tracing the empirical implications of these models, and designing experiments (or seeking natural experiments) that have the potential to provide evidence to distinguish among competing plausible models. In this general process, and in approaching interventions, institutions, and policies generally, I’ve found the following levels of abstraction perennially useful:

The first level of analysis concerns local engineering problems, in which one’s decisions neither affect the larger ecosystem nor provoke strategic reactions by other actors. For example, the digital preservation problem of selecting algorithms for fixity, replication, and verification to yield cost-effective protection against hardware and media failures is best treated at this level in most cases. For this class of problem, the tools of decision theory [1] (of which “cost-benefit” analysis is a subset), economic comparative statics, statistical forecasting, monte-carlo simulation, and causal inference [2] are helpful.

The second level concerns tactical problems, in which other actors react and adapt to your decisions ( e.g., to compete, or to avoid compliance), but the ecosystem (market, game structure, rules, norms) remains essentially fixed. For example, the problem that a single institution faces in setting (internal/external) prices (fees/fines) or usage and service policies; is a strategic one. For tractional problems, applying the tools described above is likely to yield misleading results, and some form of modeling is essential — models  from game theory, microeconomics, behavioral economics, mechanism design, and sociology are often most appropriate. Causal inference remains useful, but must be combined with modeling.

The Agenda itself is not aimed at these two levels of analysis; however, much of the NDSA working groups‘ projects and interests are at the first, local-engineering level:  NDSA publications such as content case studies, the digital preservation in a box toolkit, and the levels of preservation taxonomy may provide guidance for first-level decisions. Many of the the other working group outputs such as storage, web, and staffing surveys, although they do not describe tactical models, do provide baseline information and peer comparisons to inform decisions at the tactical level.

The third level is systems design (in this case legal-socio-technical systems) — in these types of problems, the larger environment (market, game structure, rules, norms) can be changed in meaningful ways.  Systems analysis involves understanding an entire system of simultaneous decisions and their interactions and/or designing an optimal (or at least improved) system. Examples of systems analysis are common in theory: any significant government regulation/legislation should be based on systems analysis. For institutional scale systems analysis, a number of conceptual tools are useful, particularly market design and market failure [3]; constitutional  design [4]; and the co-design of institutions and norms to manage “commons” [5].

Working at this level of analysis is difficult to do well: One must avoid the twin sins of getting lost in the weeds (too low a level of analysis for the problem) and having one’s head in the clouds (thinking at such level of generality that analysis cannot be practically applied, or worse, is vacuous). Both the Agenda and Cliff’s landscape talk are aimed at this level of analysis and manage to avoid both sins to a reasonable degree.

Academics often do not go beyond this level of designing systems that would be optimal (or at least good) and stable if actually implemented. However, it’s exceedingly rare that a single actor (or unified group of actors) has the opportunity to design entire systems at institutional scale– notable examples are the authoring of constitutions, and (perhaps) the use of intellectual property law to create new markets.

Instead, policy makers, funders and other actors with substantial influence at the institutional level are faced with a fourth level of analysis —  represented by the question of “Where do I put attention, pressure, or resources to create sustainable positive change”? And system-design alone doesn’t answer this: Design is essential for identifying where one wants to go, but policy analysis and political analysis are required to understand what actions to take in order to get (closer to) there.

This last question, of the “where do we push now” variety, is what I’ve come to expect (naturally) from my boss, and from other leaders in the field. When pressed, I’ve thus far managed to come up with (after due deliberation) some recommendations (or, at least, hypotheses) for action, but these generally seem like the hardest level of solution to get right, or even to assess as good or bad. I think the difficulty comes from having to have both a coherent high-level vision (from the systems design level) while simultaneously getting back “down into the details” (though not, “in the weeds”) to understand the current arrangements and limitations of power, resources, capacity, mechanism, attention, knowledge, and stakeholders.

Notwithstanding, although we started by aiming more at systems design than policy intervention, some recommendations of the policy intervention sort are to be found in the current Agenda. I expect that this year’s revisions, and the planned phases of external review and input, will add more breadth to these recommendations, but that it will require years more of reflection, iteration, and refinement to identify specific policy recommendation across the entire breadth of issues covered by the Agenda at the systems level.


[1] For an accessible broad overview of decision theory, game theory, and related approaches see M. Peterson [2009], An Introduction to Decision Theory . For a classic introduction to policy applications see Stokey & Zeckhauser 1978, A Primer for Policy Analysis.

[2] There are many good textbooks on statistical inference, ranging from the very basic, accessible and sensible Problem Solving by Chatfield (1995) to the sophisticated and modern Bayesian Data Analysis 3rd edition, by Gelman et. al (2013). There are relatively few good textbooks on causal inference — Judea Pearl’s (2009) Causality: Models, Reasoning and Inference 2nd edition is as definitive as a textbook can be, but challenging; Counterfactuals and Causal Inference: Methods and Principles for Social Research, Morgan & Winship’s textbook, is more accessible.

[3] Market failure is a broad topic, and most articles and even books address only some of the potential conditions for functioning markets. A, good, accessible overview is Stiglitz’s Regulation and Failure , but it doesn’t cover all areas. For information stewardship policy the economics of non-consumptive goods is particularly relevant — see Foray, The Economics of Knowledge (2006); and increasing returns and path dependence are particularly important in social/information network economies — see Arthur 1994, Increasing Returns and Path Dependence in the Economy.

[4] See Lijphart, Arend. “Constitutional design for divided societies.” Journal of democracy 15.2 (2004): 96-109. and Shugart, Matthew Soberg, and John M. Carey. Presidents and assemblies: Constitutional design and electoral dynamics. Cambridge University Press, 1992.

[5] The late Lin Ostrom work was fundamental in this area. See for example, Ostrom, Elinor. Understanding institutional diversity. Princeton University Press, 2009.

Dec 13, 9:26am

This talk, presented as part of the the Program on Information Science seminar series,  reflects on lessons learned about open data, public participation, technology, and data management from conducting crowd-sourced election mapping efforts.

This topic is discussed in detail in previous posts, most recently here .

We’ve also written quite a few articles, book chapters, etc. on the topic.

Nov 23, 9:09am

My colleague, Nancy McGovern,  who is Head of Curation and Preservation services presented this  as part of the Program on Information Science Brown Bag Series.

DIPIR  employs qualitative and quantitative data collection to investigate the reuse of digital data in  quantitative social sciences, archaeology, and zoology. It’s main research focus is on significant properties.

The team has also recently published an evaluation of the perception of researchers about what constitutes a trustworthy repository.  In DIPIR’s  sample, perceptions of trust were influenced by transparency, metadata quality, data cleaning, and  reputation with colleagues. Notably absent are such things as certifications, sustainable business models, etc. Also, as in most studies of trust in this area,  the context of “trust”  is left open — the factors that make an entity trustworthy as a source of  information are different than those that might make cause one to trust an entity to preserve deposited with it. Since researchers tend to use data repositories for both, its difficult to tease these apart.

Oct 31, 3:05pm

Personal information is ubiquitous and it is becoming increasingly easy to link information to individuals. Laws, regulations and policies  governing information privacy are complex, but most intervene through either access or  anonymization at the time of data publication.

Trends in information collection and management — cloud storage, “big” data,  and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective.

This session presented as part of the the Program on Information Science seminar series,  examines trends information privacy. And the session will also discuss emerging approaches and research around managing confidential research information throughout its lifecycle. This is in part a result of research through the Privacy Tools project.

Sep 14, 1:01pm

Much of what we know about scholarly communication and the “science of science” relies on  the scholarly record”of journal publications, monographs, and books;  and  upon the patterns of findings, evidence, and collaborations that analysis of this record reveals. In contrast, research data, in its current state, represents a type of  ’scholarly dark matter’ that underlies the current visible evidentiary relationships among publications. Improved data citation practices have the potential to make this dark matter visible.

Yesterday the Data Science Journal  published a special issue devoted to data citations: Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data. This  is a comprehensive review of data citations principles, practices, infrastructure, policy and research. And I’m very pleased to have contributed to writing and researching  this document  as part of the  CODATA-ICSTI Task Group on Data Citation Standards and Practices. 

This is a rapidly evolving area, and representatives from the CODATA-ICSTI task group,  Force 11, the Research Data Alliance and a number of other groups, have formed a  synthesis group  which is developing an integrated statement of principles  to promote broad adoption of a consistent policy for data citation across disciplines and venues.

Sep 05, 3:21pm

My collaborator Michael McDonald  and I have been analyzing the data that resulted from the crowd-sourcing participative electoral mapping projects we were involved in and other public redistricting efforts, and this blog includes two earlier articles from this line of research. In this research article, to appear in the Proceedings of the 47th Annual Hawaii International Conference on System Sciences (IEEE/Computer Society Press) we reflect on initial lessons learned about public participation and technology from the last round of U.S. electoral mapping.

Three major factors influenced the effectiveness of efforts to increase public input into the political process through crowdsourcing. First, open electoral mapping tools were a practical necessity to enable substantially greater levels increase public participation.  Second, the interest and capacity of local grassroots organizations was critical to catalyzing the public to engage using these tools.  Finally, the permeability of government authorities to public input was needed for such participation to have a significant effect.

The impermeability of government to public input in a democratic state can take a number of more-or-less subtle forms, each of which was demonstrated in the last round of electoral mapping: Authorities blatantly resist public input by providing no recognized channel for it; or by creating a nominal channel, but leaving it devoid of funding or process; or procedurally accepting input, but substantively ignoring it

Authorities can also resist public participation and transparency indirectly through the way they make essential information available to the public. For example, mapping authorities that do not wish to have potential political consequences of their plans easily evaluated publicly will not provide election results merged with census geography — although they assuredly use such merged information for internal evaluation of their plans.  Redistricting authorities may purposefully restrict the scope of the information they make available. For example, a number of states chose to make available boundaries and information related to the approved plan only. Another subtle way by which authorities can hinder transparency is by releasing information plans in a non-machine readable format. An even more subtle, but substantial barrier is the interface through which representations of plans are made available.

This resistance appears to have been in large part, effective. Public participation increased by an order of magnitude in the last round of redistricting. However, except in a few exemplary cases, visible direct effects on policy outcomes appears modest. You can find more details, in the article.

Aug 11, 7:30am

Data citation supports  attribution, provenance, discovery, provenance, and persistence. It is not (and should not be) sufficient for all of these things, but its an important component. In the last 2 years, there have been several major efforts to standardize data citation practices, build citation infrastructure, and  analyze data citation practices.

This session presented as part of the the Program on Information Science seminar series,  examines data citation from an information lifecycle approach: what are the use cases, requirements and research opportunities. And the session will also discuss emerging infrastructure and standardization efforts around data citation.

A number of principles have emerged for citation — the most central is that data citations should be treated consistently with citations to other objects:Data citations should at least provide the minimal core elements expected in other modern citations; should be included in the references section along with citations to other elements; and indexed in the same way.

Adoption of data citation by journals can  provide positive and sustainable incentives for more reproducible science and more complete attribution. This would act to brighten the dark matter of science — revealing connections among evidence bases that are not now visible through citations of articles.

Jul 30, 11:08am

Digital stewardship is vital for the authenticity of public records, the reliability of scientific evidence, and the enduring accessibility to our cultural heritage. Knowledge of ongoing research, practice, and organizational collaborations has been distributed widely across disciplines, sectors, and communities of practice.  A few days ago I was  honored to officially announce the NDSA’s National Agenda for Digital Stewardship at Digital Preservation 2013.  This identifies the highest-impact opportunities to advance the state of the art; the state of practice; and the state of collaboration within the next 3-5 years.

The 2014 Agenda integrates the perspective of dozens of experts and hundreds of institutions, convened through the Library of Congress. It outlines the challenges and opportunities related to digital preservation activities in four broad areas: Organizational Roles, Policies, and Practices; Digital Content Areas; Infrastructure Development; and Research Priorities.

Slides and video of the short (5-min) talk below:

Read the full report here:

Jul 25, 4:49pm

This presentation, invited for a workshop  on data preservation for open science,  held at JCDL 2013, gives a brief tour of a large topic — how do we understand the types of data and software used in social science research. In this presentation I characterize the intellectual landscape  across 9 dimensions of data structure, content, measure, and use. I then use this framework to characterize  three interesting use cases.

This illustrates some particular challenges for long-term access to and replication of social science research, including the use of “messy” human sensors; the wide mix of data types, structures, sparsity; complex legal constraints; pervasive use of manual and computer-aided coding; use of niche commercial software and bespoke software; and very long-term access requirements.