Feb 13, 10:39am

Kendra Albert,  who has served as research associate at the Harvard Law School; as an intern at the Electronic Frontier Foundation; as a fellow at the Berkman Center for Internet & Society; and is  now completing her J.D. at Harvard Law,  presented this talk  as part of the Program on Information Science Brown Bag Series.

Kendra brings a fresh perspective developed through collaborating with librarians and archivists on projects such as as, EFF’s response to DMCA 1201, and our PrivacyTools project.

In her talk, Kendra discusses the intersection of law, librarianship and advocacy, focuses on the following question:

Archival institutions and libraries are often on the front lines of battles over ownership of digital content and the legality of ensuring copies are preserved. How can institutions devoted to preservation use their expertise to advocate for users? 


A number of themes ran through Kendra’s presentation:

  • Libraries have substantial potential to affect law and policy by advocating for legal change
  • Libraries enjoy a position of trust as an information source, and as an authority on long-term access for posterity
  • Intellectual property law that is created for the purpose of limiting present use may have  substantial unintended consequences for long-term access and cultural heritage.

Reflecting on Kendra’s talk, and on the subsequent discussions…

The courts have sometimes recognized preservation as having value — explicitly in formulating DMCA exceptions, and implicitly, in adopting But, the gaps between the private value of the content to the controller in the short term, and its value to the public in the long-term value  is both a strength and a weakness for preservation efforts.

For example, Kendra’s talk noted that the lack of a market for older games is an important factor for determining that distribution of that content is fair use —  which works in favor of preservation. The talk also mentioned that the game companies short-term focus on the next release was a barrier to collaborating on preservation activities. These points seem to me connected — the companies would become interested if there were a market… but this would, in turn, weaken the fair use consideration. Effective public preservation efforts must walk a tightrope — supporting access and use that is of value, but not either impinging on private value in the short term, or creating so much of a market for access, that there is political pressure to re-privatize the market.

Furthermore, it is well recognized that institutional legal counsel tends to be conservative … both to minimize risks to the institution as a whole, and to avoid the risk of setting precedent with bad cases.  It is clear from Kendra’s talk that librarians dealing with projects that use intellectual property in new ways should both engage with their institution’s legal counsel early in the process, and have some independent legal expertise on the library team in order to generate possible new approaches.

For more information you can see some of the outputs of Kendra’s work here:


Jan 07, 12:44pm

Marguerite Avery,  who is a Research Affiliate in the program, reflects on changes in the research and scholarly publishing ecoystem. 

Some thoughts on “Shaking it Up : How to Thrive in – and Change – the Research Ecosystem some weeks later – proof that we’ve embraced the need for change and just how far the conversation has evolved

When I got together with Amy Brand (Digital Science) and Chris Erdman (Center for Astrophysics library at Harvard) in August to kickstart planning for the workshop that would become Shaking it Up, our goal was to continue an earlier conversation started by Tony Hey and Lee Dirks (Microsoft Research) at a 2011 workshop on eScience, Transforming Scholarly Communication.  Its objective, according to my notes, was to present a roadmap for action with a four- to five-year time horizon specifying recommendations for the whole scholarly communications ecosystems. Having already passed the halfway mark, this was an opportune time to check in and take stock of the evolving scholarly communication ecosystem, and raise key questions: How are we doing? Where are we on this path? What’s working? What’s still broken? What progress have we made?


Providing definitive answers to these open-ended questions was decidedly out of scope, so we focused business models – on one of the major obstacles in the evolution of scholarly communication ecosystem. The willingness to consider alternatives to traditional models, coupled with the proliferation of startups in this space, demonstrated some progress in changing attitudes and expectations of scholars and researchers, and to other stakeholders. Yet the greater institutional forces lagged behind with a willful yet full understandable deference to an aging knowledge infrastructure. Our theme would be hacking business models, and we set out to assemble those people in scholarly communication who landed somewhere between the  disruptive innovation of Clay Christensen and the creative destructive of Joseph Schumpeter.  Our delegates would report on new pathways, models, and formats in scholarly communication, and we could get a snapshot of our progress at this midpoint check-up.


The lapsed historian in me insists that we cannot assess our current standpoint without some historical context, thus a recap of the 2011 meeting follows. Transforming Scholarly Communication was an ambitious event in scope and presentation. The workshop spanned three days, with one each devoted to raising issues, demonstrations from the field, and the drafting of a report. Invited participation was capped at 75, and participants were assigned to one of six groups: platforms, resources, review, recognition, media, and literature. (See the table below for descriptions of each group.) The opening remarks (as gleaned from a rough transcript on the workshop Tumblr) from Phil Bourne  focused on the benefits of open science beyond the (already contentious) end result of free content. Much more provocative was a shift in thought from considering scholarly communication simply as a product to treating it as a process — and a complex one at that. Bourne stated: “Open science is more tha[n] changing how we interface with the final product; it is interfacing with the complete scientific process – motivation, ideas, hypotheses, experiments to test the hypotheses, data generated, analysis of that data, conclusions, and awareness.”  With this refocusing on processes, collaboration becomes visible and thus possible to be assessed and valued.

This emphasis on action bleeds over from scholarly communication itself into the necessity for cultivating a movement to affect change. “We need to define ways to recruit a movement – it will take more than tools to do so – are there clear wins for all concerned? If so, what are they? Platforms to disseminate scholarship, new reward systems, knowledge discovery from open access content…” Although it was not identified as such, Bourne has given recognition to the scholarly communication ecosystem. [this allows us to think of of new “products” and to pay attention to the infrastructure] The revolutionary aspect of his proposal wasn’t so much the free content, but what we consider to be content and how it would be published.

The six categories – platforms, media, literature, review, resources, and recognition – ran the gamut of the processes (and products) of the scholarly communication ecosystem. Each group was to consider the following issues: essential elements in successful systems, failure mode of unsuccessful experiments, promising future technologies, key unsolved problems, and interdependencies with other topics. (The notes from each section are accessible in full on the Tumblr page). The focus and recommendations varied with each topic: the Resources group, with its extensive listing of tools for each stage of the research process, was particularly concerned with the differences between formats and the lack of universal search capabilities across platforms and topics; the Platform group lamented the low participation rates for developing tools and worried over the blurring between platforms for personal and professional use; and the Recognition group cautioned that new tools should augment rather than supplant established forms of published communication.


Platforms project collaboration software, “smart” laboratory software, provenance systems
Media production, distribution, archiving (e.g. video, 3-D modeling, databases)
Literature publications based on test and still images (including creation, reviewing, dissemination, archiving, reproducibility)
Review standard publication-based systems, alternative rating systems, etc.
Resources seamless technologies for literature and data (literature/data search engines; cloud-based, group sharing, adjustable permissions, integration with search)
Recognition how can we best enable cooperation and adoption?

Six themes for 2011 eScience workshop


As one of only three traditional publishers in the room, I was focused on our role within scholarly communication – what was our value-add to the process? I identified four functions  – authority (how do we know whom to trust?), discoverability (how will your work find its audience?), recognition (will your community acknowledge your contribution as valid?), and community/collaboration (will your audience engage with your work?) The scholarly communication ecosystem generates an enormous volume of content, from the granular and descriptive (data sets and procedures / processes), the observational (tweets, blog posts), to the reflective and declarative (journal articles and books with hypotheses/arguments). It clearly doesn’t make sense for traditional publishers to participate in every stage, however that doesn’t mean this content should not be published. And by published, I mean made readily available for peer critique and consumption. (Which leads into a type of peer review, with readers assessing and determining value.)

As I had been one of three participants assigned to float from group to group, I was charged with making a summary statement. My closing remarks echoed Bourne’s identification of a more holistic approach to scholarly communication in terms of product and process. “Clearly scholarly communication needs to move beyond words and print. Fields of inquiry have changed dramatically since the dawn of scholarly publishing hundreds of years ago. Research encompasses subjects requiring massive data sets, highly complex procedures, and massive collaboration.” I advocated for the inclusion of “ancillary material” – all of the content that we could neither print nor incorporate into traditional texts and push to the web such as data sets, video, audio, workflows and processes, color images, and dynamic processes – and acknowledged these would require a new model of authoritative digital publication.

And despite the frustration voiced by many participants with the shortcomings of the current publishing process, an indelible reverence remained for the work of publishers.  “As Sharon Traweek reminds us, the scholarly journal article has remained a stable, enduring standard of scholarship. She wisely reminds us of the difference between documenting process and documenting results. David Shotton echoed this, by describing a peer reviewed journal article as a line in the sand upon which knowledge is built and captures a scientist’s understanding of his or her work at a particular moment. Perhaps we need to distinguish between scholarship and research, with scholarship being the codified results, and research representing the process and observations. And while these tools are two halves of the same whole, each requires a different approach and different tools to reach different audiences.” At this meeting, I caught a glimpse of an evolving role for scholarly publishers, but without a clear path forward.


How the conversation has evolved over the last three years.


Shaking It Up illuminated a number of specific issues for us to consider in scholarly communication reform but how do these compare from 2011 to 2014?  (I won’t offer a summary of the workshop – see accounts from Roger Schonfeld and  Lou Woodley; and the DigitalScience report will be released soon.) For starters, the emphasis on infrastructure increased dramatically. Whereas in 2011, the word infrastructure appears only three times(!!!) in the meeting documentation, it undergirds the entire 2014 discussion as well as being the subject of the keynote. In “Mind the Gap: Scholarly Cyberinfrastructure is the Third Rail of Funders, Institutions, and Researchers: Why Do we Keep/ Getting it Wrong and What Can We Do about it?” CrossRef’s Geoff Bilder offered energetic provocations on the flaws in scholarly communication cyberinfrastructures (“Why are we so crap at infrastructure?”). He lamented the inclination of scholars and researchers to forge their own systems specifically tailored to their research agendas and outputs at the expense of interoperability and persistence. While these forays represent brief moments of progress and do serve to push the conversation and expectations, the unique systems (and their languishing trapped content) ultimately do not meet the broader needs of a scholarly and research communities over time. He advocated for a shared, interoperable platform created by professionals who think about infrastructure.  Overall this demonstrates a significant change in how we think about our research objects and environment.


(It’s difficult not to view this as a metacommentary on the evolution of scholarly communication itself, specifically with original digital publications. These often suffer a similar fate due to the particular creation and design carefully tied to the specific research project; while this may yield a brilliant instance of digital publication, its long-term fate is tenuous due to a highly specialized format, platform, and/or mode of access.)

Another welcome departure was the level of community engagement on scholarly communication issues. If the 2011 meeting participants considered themselves representative of only 10% of their communities, they would no longer hold a minority opinion or awareness. The importance of open access, intellectual property, and copyright is recognized across disciplines, if still unevenly.  And not only are individuals and thematic communities taking on these issues, but so are institutions. The number of scholarly communication officers (or dedicated offices) at universities is growing, as is the adoption of open access mandates and the use of institutional repositories on campuses. (And of course funding agencies and other sponsoring organizations have long been sympathetic to open access.) Although the depth of knowledge and understanding of such issues needs continued improvement, these concerns are no longer relegated to the fringe but have become scholarly mainstream.

And as our more science-themed meeting unfolded in Cambridge MA and virtually, other discussions were happening in parallel. Leuphana University’s Hybrid Publishing Lab in Lünenberg, Germany hosted the Post-Digital Scholar Conference. This humanities-focused meeting “focused on the interplay between pixels and print, and discussed upon and closed modes of knowledge, in order to seek out what this elusive thing could be: post-digital knowledge.” It’s fair to say that many scholars and researchers agree with the need for change and  the areas in which change are most needed (for example, see this previous talk for thoughts on where change is needed in university publishing). Many scholars are increasingly comfortable experimenting with these possibilities, the level of concern becomes more granular as we reflect upon the experience of these new possibilities and how these resonate in the professional space. What are the barriers here?


Addressing Access, Attribution, Authority – and Format  


While the original six themes [platforms | media | literature | review | resources | recognition] proved incredibly useful for discovery and early identification of the issues and challenges of transforming scholarly communication, I’ve sharpened my categories of analysis. Based on my many conversations with scholars and researchers over the years on the publication of digital scholarship and new formats, I’ve determined that the concerns and issues facing a new digital publication formats fall across three categories: access, attribution, and authority. If the problem is traditional scholarly publishing cannot accommodate research objects and research results in the digital space, then why haven’t we experienced the groundswell of creative activity and innovation we’ve seen in other areas of scholarly communication such as search, sharing, and collaboration? The barriers to participation –  setting aside the obvious issues such as time, technical skill, and/or resources – are access, attribution, and authority. In other words, innovative digital publications are viewed as a professional gamble, which only a few brave souls are willing to take at this time due to the current affordances of the scholarly communication ecosystem.


Let me say more about these three points:

  • To elaborate on access: scholars are legitimately concerned by the prospect of creating an innovative digital publication only to have it inaccessible to its audience; this could happen immediately either immediately or over time as a proprietary platform goes fallow due to an end to funding, the creator/keeper of the platform moving on to new projects, and libraries – the preservation endpoint – being unequipped and unprepared to handle these digital orphans. Bilder’s infrastructure concerns speak precisely to this point.
  • To elaborate on attribution: scholars publish work not only to share their research results and to advance the conversations in their fields and beyond, but of equal importance is the acknowledgement of authorship of the work. Receiving credit for a publication in the eyes of the academy.
  • To elaborate on authority: and speaking of receiving credit from one’s peers and adjudicators (e.g. tenure and promotion committees), a publication carries more weight if published with an established press of solid reputation. Of course these presses are not publishing anything beyond standard books and journal articles, and most anything digital is really just electronic – print content marked up for reading on a device – and not truly embracing the affordances of the digital space.


I was gratified to hear echoes of these points throughout the day. Accessing Content: New Thinking / New Business Models for Accessing Research Literature addressed challenges of new formats and access barriers (e.g. Eric Hellman of Gluejar raised the problem of libraries including open access titles within their catalogs, as well as presentations from ReadCube and Mendeley). Attribution and authority were themes within the Measuring Content: Evaluation, Metrics, and Measuring Impact panel, which grappled with these issues in developing tools for alternative assessment (e.g. altmetrics) with presentations from UberResearch and Plum Analytics. The last panel, Publishing Content: The Article of the Future, offered alternative visions for the future of scholarly communication (albeit while still adhering to the book-journal binary with ‘article’ of the future). These participants posed the greatest challenge to traditional format expectations by embracing the affordances of the web: Authorea,, and Annotopia offer tools that mesh readily with aspects of the existing infrastructure while simultaneously threatening other elements of the scholarly communication ecosystem. And this is what we’d hoped to accomplish with Shaking It Up – to identify an array of business models and the possibilities for changing existing structures and/or developing new ones to accommodate the changing needs of scholarship.

Where do we go from here?

So while the observation that so much as changed and yet so much remains the same seems cliché, there really couldn’t be another analysis when you think about the disparate factors at play:  scientific research (communities, practices, and research results) are evolving at warp speed, while university administrations  and scholarly publishing entities have a retrenched commitment to the persistence of traditional content dissemination from the product to the supporting infrastructure for publishing, which undergirds the tenure and promotion process. I applaud the incremental change pushed from start-ups and distinct projects exploring facets of this massive issue – and it is a massive issue with an infrastructure, many moving parts, and continually evolving research methods and results – as these demonstrate the changing needs of the communities, that this is embraced by users, and that other possibilities exist beyond the established systems. But it’s not enough. To move the needle, we need a critical mass of scholars publishing authoritative digital publications in an array of formats. Otherwise, these stay on the fringe. Just like that guy using cooking oil as automobile fuel.


And how can we change these systems? Stay tuned for the next blog post.

Dec 22, 9:30am

My colleague,  Jana Dambrogio,  Thomas F. Peterson (1957) Conservator, MIT Libraries presented this talk  as part of the Program on Information Science Brown Bag Series.  Jana is an is an expert in physical preservation, having worked in  the preservation field for 15 years as a conservator, consultant, and teaching professional. at the US National Archives, the United Nations, the Vatican Secret Archives — and now we are pleased to have her at MIT.

In her  talk, below, Jana discusses two research projects to preserve artifactual knowledge in MIT Libraries’ special collections —  including work to reengineer ‘letterlocking’ methods and the  broken spines of historical books.

A number of themes ran through Jana’s presentation:

  • To conserve physical objects requires ensuring that their integrity is maintained for access and interpretation:
  • Effective conservation may require applied research to reengineer the original  processes used to produce historical works, in order to understand what information was conveyed by choice of that process.

Reflecting on Jana’s talk, I see connections between her physical conservation research and information science more generally…

The information associated with a work is not simply that embedded within it, although the embedded information is often the focus when creating digital surrogates — both digital and physical may carry with them information about their method of production, provenance, security,  authenticity, history of use, and affordances of use. It is useful to model each of these types of information, even if one chooses not to spend equal amounts of effort in capturing each.

Second, new fabrication technologies, such as 3-D printing, are making the boundaries between physical and digital more permeable. Patrons may learn of the affordances of a work through a fabricated surrogate, for example. Furthermore, the scanning and digitization processes that are being used in association with rapid fabrication may also be used in conservation practice as part of the reengineering process — Jana’s presentation describes working with researchers at MIT to do just this…

Finally, collaboration with educational and research users is increasingly important in understanding the potential for information associated with each work, and thus to guide selection and conservation in order to create of a portfolio of works that is likely to be of future educational and research value. As in digital curation, we can’t offer access to everything, for everyone, forever — so modeling the information associated with its work, and its future uses, is critical to making rational decisions.



Dec 08, 3:51pm

Marguerite Avery,  who is a Research Affiliate in the program, presented the talk below  as part of Shaking It Up —  a one-day workshop on the changing state of the research ecosystem jointly sponsored by Digital Science, MIT, Harvard and Microsoft.

For the past ten years, Margy was Senior Acquisitions Editor at The MIT Press where she acquired scholarly, trade, and reference work in Science and Technology Studies, Information Science, Communications, and Internet Studies. She joined the research program in September to collaborate on explorations of new forms of library publishing.

Her talk focuses on current challenges around the accessibility of scholarly content and on a scan of innovative new models aimed to address them.

A number of themes ran through the talk:

  • The two formats published by vast majority of University Presses books, and journals, increasingly compromise the ability of the press to capture and publish modern research.
  • The time to publish is also increasingly out of sync with the pace of research — publication occurs too slowly.
  • Existing business models and price points are a significant barrier for university presses that do wish to move to different formats or more agile publication models

As a follow-on, we are collaborating to analytically unpack the “university press” model, and identify the  minmum necessary characteristics for a sustainable publisher of scholarship. Some preliminary thoughts on a short list include:

  • A process to ensure durability of the published work — possibly through supporting organizations such as Hathi Trust, the Internet Archive, Portico, SSRN, LOCKSS, or CLOCKSS
  • A mechanism to persistently and uniquely identify works — likely through ISBN’s (supported by Bowker) and DOI’s (supported by CROSSREF)
  • Metadata and mechanisms supporting metadata discoverability — e.g. MARC records, LC catalog entries, WorldCat entries, ONIX feeds
  • Mechanisms for supporting content discoverability and previewing, —  e.g. through google Google Books, Google Scholar, Amazon, Books in Print
  • A business process to broker and process purchases and subscriptions
  • A way to select quality content and to signal the quality of the selection
  • A process to establish and maintain an acquisition pipeline
  • A production workflow
  • Marketing channels

Matching these necessary criteria to new forms of scholarship, which are accompanied new affordances and barriers, promises to be an interesting and challenging task.

Oct 30, 10:19am

My colleague,  Stephen Griffin,  who is Visiting Professor and Mellon Cyberscholar at the University of Pittsburgh, School of Information Sciences, presented this talk  as part of the Program on Information Science Brown Bag Series.  Steve is an expert in Digital Libraries and has a broad perspective on the evolution of library and information science — having had  a 32-year career at the National Science Foundation (NSF), as a Program Director in the Division of Information and Intelligent Systems. Steve lead the  Interagency Digital Libraries Initiatives and International Digital Libraries Collaborative Research Programs, which supported many notable digital library projects (including my first large research project).

In his talk, below, Steve discusses how how research libraries can play a key and expanded role in enabling digital scholarship and creating the supporting activities that sustain it.

In his abstract, Steve describes his talk as follows:

Contemporary research and scholarship is increasingly characterized by the use of large-scale datasets and computationally intensive tasks.  A broad range of scholarly activities is reliant upon many kinds of information objects, located in distant geographical locations expressed in different formats on a variety of mediums.  These data can be dynamic in nature, constantly enriched by other users and automated processes.

Accompanying data-centered approaches to inquiry have been cultural shifts in the scholarly community that challenge long-standing assumptions that underpin the structure and mores of academic institutions, as well as to call into question the efficacy and fairness of traditional models of scholarly communication.  Scholars are now demanding new models of scholarly communication that capture a comprehensive record of workflows and accommodate the complex and accretive nature of digital scholarship.  Computation and data-intensive digital scholarship present special challenges in this regard, as reproducing the results may not be possible based solely on the descriptive information presented in traditional journal publications.  Scholars are also calling for greater authority in the publication of their works and rights management.

Agreement is growing on how best to manage and share massive amounts of diverse and complex information objects.  Open standards and technologies allow interoperability across institutional repositories.  Content level interoperability based on semantic web and linked open data standards is becoming more common.   Information research objects are increasingly thought of as social as well as data objects – promoting knowledge creation and sharing and possessing qualities that promote new forms of scholarly arrangements and collaboration.  These developments are important to advance the conduct and communication of contemporary research.  At the same time, the scope of problem domains can expand, disciplinary boundaries fade and interdisciplinary research can thrive.

This talk will present alternative paths for expanding the scope and reach of digital scholarship and robust models of scholarly communication necessary for full reporting.  Academic research libraries will play a key and expanded role in enabling digital scholarship and creating the supporting activities that sustain it.  The overall goals are to increase research productivity and impact, and to give scholars a new type of intellectual freedom of expression.

From my point of view, a number of themes ran through Steve’s presentation:

  • Grand challenges in computing have shifted focus from computing capacity to managing and understanding information; … and repositories have shifted from simple discovery towards data integration.
  • As more information has become available the space of problems we can examine expands from natural sciences to other scientific areas — especially to a large array of problems in social science and humanities; but
    …  research funding is shifting further away from social sciences and humanities.
  • Reproducibiity has become a crisis in sciences; and
    … reproducibility requires a comprehensive record of the research process and scholarly workflow
  • Data sharing and support for replication still occurs primarily at the end of the scientific workflow
    … accelerating the research cycle requires integrating sharing of data and analysis in much earlier stages of workflow, towards a continually open research process.

Steve’s talk includes a number of recommendations for libraries. First and foremost to my view is that libraries will need to act as partners with scientists in their research, in order to support open science, accelerated science, and the integration of information management and sharing workflows into earlier stages of the research process I agree with this wholeheartedly and have made it a part of the mission of our Program.

The talk suggests a set of specific priorities for libraries. I don’t think one set of priorities will fit all set of research libraries — because pursuit of projects is necessarily, and appropriately opportunistic — and depends on the competitive advantage of the institutions involved and the needs of local stakeholders. However, I would recommend adding rapid fabrication, scholarly evaluation, crowdsourcing, library publishing, long-term access generally to the list of priorities in the talk.

Steve’s talk  makes the point Libraries will need to act as partners with scientists in their research, in order to support accelerated science, that integration information management and reproducibility into earlier stages of the research process.

Oct 27, 8:07pm

This talk, presented as guest lecture in Ron Rivest’s and Charles Stewart’s class on Elections and Technology, reflects on the use of technology in redistricting, and lessons learned about open data, public participation, technology, and data management from conducting crowd-sourced election mapping efforts.

Some observations:

  • On technical implementation: There is still a substantial gap between the models and methods used in technology stack, and that used in mapping and elections.  The domain of electoral geography deals with census, and administrative units; legally defined relationships among units; randomized interventions — where GIS deals with polygons, layers, and geospatial relationships. These concept often maps — with some exceptions — and in political geography, one can run into a lot of problems if one doesn’t pay attention to the exceptions. For example, spatial contiguity is often the same as legal contiguity, but not always — and implementing the “not always” part implies a whole set of separate data structures, algorithms, and interfaces.
  • On policy & transparency: We often assume that transparency is satisfied by making the rules (the algorithm) clear, and the inputs to the rules  (the data) publicly available. In election technology, however, code matters too — its impossible to verify or correct implementation of an algorithm without the code; and the form of the data matters —  transparent data containing complete information, in accessible formats, available through a standard API, accompanied by documentation, and evidence of authenticity.
  • On policy & participation: Redistricting plans are a form of policy proposal. Technology is necessary to enable of richer participation in redistricting — it enables individuals to make complete, coherent alternative proposals to those offered by the legislature. Technology is not sufficient, although the courts sometimes pay attention to these publicly submitted maps, legislatures have strong incentives to act in self-interested ways. Institutional changes are needed before fully participative redistricting becomes a reality.
  • On policy implementation: engagement with existing grass-roots organizations and the media was critical for participation. Don’t assume that if you build it, anyone will come…
  • On methodology: Crowd-sourcing enables us to sample from plans that are discoverable by humans — this is really useful as  unbiased random-sampling of legal redistricting plans is not feasible. By crowd-sourcing large sets of plans we can examine the achievable trade-offs among redistricting criteria, and conduct a “revealed preference” analysis  to determine legislative intent.
  • Ad-hoc, miscellaneous, preliminary observations: Field experiments in this area are hard —  there are a lot of moving parts to manage  — creating the state of the practice, while meeting the timeline of politics, while working to keep the methodology (etc.) clean enough to analyze later. And always remember Kransberg’s 1rst law: technology is neither good nor bad — neither is it neutral.

We’ve also written quite a few articles, book chapters, etc. on the topic that expand on many of these topics.


Oct 03, 10:30am

Personal information continues to become more available, increasingly easy to link to individuals, and increasingly important for research. New laws, regulations and policies governing information privacy continue to emerge, increasing the complexity of management. Trends in information collection and management — cloud storage, “big” data, and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective.

The slides below provide an overview changing landscape of information privacy with a focus on the possible consequences of these changes for researchers and research institutions.
Personal information continues to become more available, increasingly easy to link to individuals, and increasingly important for research, and was originally presented as part of the Program on Information Science Brown Bag Series

Across the emerging examples of data and big prvacy, a number of different challenges recur that appear to be novel to big data, and which drew the attention of the attending experts. In our privacy research collaborations we have started to assign names for  these privacy problems for easy reference:

  1. The “data density” problem — many forms of “big” data used in computational social science measure more attributes, contain more granularity and provide richer and more complex structure than traditional data sources. This creates a number of challenges for traditional confidentiality protections including:
    1. Since big data often has quite different distributional properties from “traditional data”, traditional methods of generalization and suppression cannot be used without sacrificing large amounts of utility.
    2. Traditional methods concentrate on protecting tabular data. However, computational social science increasingly makes use of text, spatial traces, networks, images and data in a wide variety of heterogenous structures.
  2. The “data exhaust” problem – traditional studies of humans focused on data collected explicitly for that purpose. Computational social science increasingly uses data that is collected for other purposes. This creates a number of challenges, including:
    1. Access to “data exhaust” cannot easily be limited by the researcher – although a researcher may limit access to their own copy, the exhaust may be available from commercial sources; or similar measurements may be available from other exhaust streams. This increases the risk that any sensitive information linked with the exhaust streams can be reassociated with an individual.
    2. Data exhaust often produces fine-grained observations of individuals over time. Because of regularities in human behavior, patterns in data exhaust can be used to ‘fingerprint’ an individual – enabling potential reidentification even in the absence of explicit identifiers or quasi-identifiers.
  3. The “it’s only ice cream” problem – traditional approaches to protecting confidential data focus on protecting “sensitive” attributes, such as measures of disfavored behavior, or “identifying” attributes, such as gender or weight.  Attributes such as “favorite flavor of ice cream” or “favorite foreign movie” would not traditionally be protected – and could even be disclosed in an identified form. However the richness, variety, and coverage of big data used in computational social science substantially increases the risk that any ‘nonsensitive’ attribute could, in combination with other  publicly available, nonsensitive information, be used to identify an individual. This makes it increasingly difficult to predict and ameliorate the risks to confidentiality associated with release of the data.
  4. The “doesn’t stay in Vegas” problem – in traditional social science research, most of the information used was obtained and used within approximately the same context – accessing information outside of its original context was often quite costly.  Increasingly, computational social science uses information that was shared in a local context for a small audience, but is available in a global context, and to a world audience.  This creates a number of challenges, including:
    1. The scope of the consent, whether implied or express, of the individuals being studied using new data sources may be unclear. And commercial service according to terms of service and privacy policies may not clearly disclose third-party research uses.
    2. Data may be collected over a long period of time under evolving terms of service and expectations
    3. Data may be collected across a broad variety of locations – each of which may have different expectations and legal rules regarding confidentiality.
    4. Future uses of the data and concomitant risks are not apparent at the time of collection, when notice and consent may be given.
  5. The “algorithmic discrimination” problem – in traditional social science, models for analysis and decision-making were human-mediated. The use of big data with many measures, and/or complex models (e.g. machine-learning models) or models lacking formal inferential definitions (e.g. many clustering models), can lead to algorithmic discrimination that is neither intended by nor immediately discernable to the researcher.

Our forthcoming working papers from the Privacy Tools for Sharing Research Data explore these issues in more detail.

Sep 23, 2:57pm

My colleague,  Ben Lewis,  who is system architect and project manager for WorldMap, created at the Center for Geographic Analysis at Harvard presented this talk  as part of the Program on Information Science Brown Bag Series.  Ben is an expert in GIS systems and platforms and has developed many interesting tools in this area.

In his talk, below, Ben discusses the  WorldMap platform (, which is claimed to be the largest open source collaborative mapping system in the world, with over 13,000 map layers contributed by thousands of users from around the world. Researchers may upload large spatial datasets to the system, create data-driven visualizations, edit data, and control access. Users may keep their data private, share it in groups, or publish to the world. Ben discussed current work to create and maintain a global registry of map services and take us a step closer to one-stop-access for public geospatial data.

A number of themes ran through Ben’s presentation:

  • Space time coordinates are an organizing facet for a  huge variety of human and natural information — everything that happens, happens at a particular time and place.
  • Most of the geospatial web cannot be discovered through standard search engines. A major goal of Ben’s projects is to expose this “dark geoweb”, which he estimates to comprise millions of map layers.
  • Libraries need to be increasingly savvy about space in choosing and developing platforms for discovery and analysis, so that their clients can benefit from advances in GIS services and platforms and geospatial collections.

Sep 17, 10:03am

This talk was sponsored by the MIT Postdoctoral Association with support from the Office of the Vice President for Research.

In the rapidly changing world of research and scholarly communications researchers are faced with a rapidly growing range of options to publicly disseminate, review, and discuss research—options which will affect their long-term reputation. Junior scholars must be especially thoughtful in choosing how much effort to invest in dissemination and communication, and what strategies to use.

In this talk, I briefly discuss a number of review of bibliometric and scientometric studies of quantitative research impact, a sampling of influential qualitative writings advising this area, and an environmental scan of emerging researcher profile systems. Based on this review, and on professional experience on dozens of review panels, I suggest some steps junior researchers may consider when disseminating their research and participating in public review and discussion.

My somewhat idiosyncratic recommendations are in three categories. The tactical, strategic, and “next steps”:

Tactical Recommendations

  • Identify and use opportunities to communicate:
    • Accept invited talks, where practical
    • Announce when you will be speaking, teaching
    • Share your presentations, writings, and data
  • Create a scholarly identit
    • Obtain an ORCID, domain name, twitter handle, LinkeIn profile, Google Scholar profile
    • Create a short bio and longer CV
    • Develop a research theme, and signature idea
  • Communicate broadly
    • Publish writings as Open Access when possible
    • Publish data and software as open data and open source
    • Use social media (LinkedIN, Twitter) to announce new publications, teaching, speaking
  • Develop communications skills early
    • Take writing lessons early
    • Take public speaking lessons early
  • Monitor your impact
    • Monitor news, citation, social media metrics, and altmetrics that reflect the impact of your work
    • Keep records
    • Do this systematically, regularly, but not reactively or obsessively
  • Focus on Clarity and Significance
    • Do research that is important to you and that you think is important to the world
    • When writing about your research, work to maximize clarity – including in abstracts, titles, and citations
  • Give credit generously
    • Cite software you use
    • Cite data on which your analyses rely
    • Don’t be afraid to cite your own work
    • Discuss authorship early, and document contributions publicly

Unordered Strategic Recommendations

  • Do research that is important to you and that you think is important to the world
  • Manage your research program – find a core theme, a signature idea, and regularly review comparative strengths, comparative weaknesses, timely opportunities and future threats
  • Collaborate with people you respect, and like working with, start with small steps
  • Take a positive and sustained interest in the work and career of others, this is the foundation of professional networking
  • Make a moderate, but systematic effort to understand and monitor the institutions within which your work is embedded.
  • Identify your core strengths. Build a career around those.
  • Identify the weaknesses that are continual stumbling blocks. Make them good enough.
  • Pay attention to your world: exercise, sleep, diet, stress, relationships
  • Don’t manage your time – manage your life: know your values, choose your priorities, monitor your progress
  • Align your career with your core values

Ten Things to try right now…

Identify yourself 

1.  Register for an ORCID identifier

2. Register for information hubs: LinkedIN, Slideshare, and a domain name of your own

3. Register for Twitter

Describe yourself …
write these and post to your LinkedIN and ORCID Profiles

4. Write and share a 1-paragraph bio

5. Describe your research program in 2 paragraph

6. Create a CV


7. Share (on Twitter & LinkedIN) news about something you did or published; an upcoming event in which you will participate; interesting news  and publications in your field

8.  Make writing; data; publication; software available as Open Access (through your institutional repository, SlideShare,, Dataverse, FigShare)

check and record these things regularly, but not too frequently (once a month) — and no need to react or adjust immediately

9. Set up tracking of your citations, mentions, and topics you are interested in using  Google scholar and  Google alert,

10. Find your Klout score, H-index.

In the full presentation, I show how to gather impact data, review findings from bibliometric research on how to increase impact by choosing titles, venues, and the like; and consider the advice for success given by the scores of books I’ve scanned on this topic.

The full presentation is available here:


Jul 22, 11:13am

To summarize, altmetrics  should  build on existing statistical and social science methods for developing reliable measures. The draft white paper from the NISO altmetrics project suggests many interesting potential action items, but does not yet incorporate, suggest or reference a framework for systematic definition or evaluation of  metrics.

NISO offered a recent opportunity to comment on the draft recommendation on their ‘Altmetrics Standards Project’. MIT is a non-voting NISO member, and I am the current ‘representative’ to NISO. The following is my commentary, on the draft recommendation. You may also be interested in reading the other commentaries on this draft.

Response to request for public comments on on ‘NISO Altmetrics Standards Project White Paper ’

Scholarly metrics should be broadly understood as measurement constructs applied to the domain of scholarly/research (broadly, any form of rigorous enquiry), outputs, actors, impacts (i.e. broader consequences), and the relationships among them. Most traditional formal scholarly metrics, such as the H-Index, Journal impact Factor, and citation count, are relatively simple summary statistics applied to the attributes of a corpus of bibliographic citations extracted from a selection of peer-reviewed journals. The Altmetrics movement aims to develop more sophisticated measures, based on a broader set of attributes, and covering a deeper corpus of outputs.

As the Draft aptly notes, in general our current scholarly metrics, and the decision systems around them are far from rigorous: “Unfortunately, the scientific rigor applied to using these numbers for evaluation is often far below the rigor scholars use in their own scholarship.” [1]

The Draft takes a step towards a more rigorous understanding of alt metrics. It’s primary contribution is to suggest a set of potential action items to increase clarity and understanding.

However, the Draft does not yet identify either the key elements of a rigorous (or systematic) foundation for defining scholarly metrics, their properties, and quality. Nor does the Draft identify key research in evaluation and measurement that provide a potential foundation. The aim of these comments is to start to fill this structural.

Informally speaking, good scholarly metrics are fit for use in a scholarly incentive system. More formally, most scholarly metrics are parts of larger evaluation and incentive systems, where the metric is used to support descriptive and predictive/causal inference, in support of some decision.

Defining metrics formally in this way also helps to clarify what characteristics of metrics are important for determining their quality and usefulness.

– Characteristics supporting any inference. Classical test theory is well developed in this area. [2] Useful metric supports some form of inference, and reliable inference requires reliablilty.[3]  Informally, good metrics should yield the similar results across  repeated measurements of the same purported phenomenon.
– Characteristics supporting descriptive inference. Since an objective of most incentive systems is descriptive, good measures must have appropriate measurement validity. [4] In informal terms, all measures should be internally consistent;  and the metric should be related to the concept being measured.
– Characteristics supporting prediction or intervention. Since objective of most incentive systems is both descriptive and predictive/causal inference, good measures must aid accurate and unbiased  inference. [5] In informal terms, the metric should demonstrably be able to increase the accuracy of predicting something relevant to scholarly evaluation.
– Characteristics supporting decisions. Decision theory is well developed in this area [6]: The usefulness of metrics is dependent on the cost of computing the metric, and the value of the information that the metric produces. The value of the information depends on the expected value of the optimal decisions that would be produced with and without that information. In informal terms, good metrics provide information that helps one avoid costly mistakes, and good metrics cost less than the expected of the mistakes one avoids by using them.
– Characteristics supporting evaluation systems. This is a more complex area, but the field of game theory and mechanism design are most relevant.  Measures that are used in a strategic context must be resistant to manipulation — either (a) requiring extensive resources to manipulate, (b) requiring extensive coordination across independent actors to manipulate, or by (c) inventing truthful revelation. Trust engineering is another relevant area — characteristics such as transparency, monitoring, and punishment of bad behavior, among other systems factors, may have substantial effects. [8]

The above characteristics comprise a large part of the scientific basis for assessing the quality and usefulness of scholarly metrics. They are necessarily abstract, but closely related to the categories of action items already in the report. In particular to Definitions; Research Evaluation; Data Quality; and Grouping. Specifically, we recommend adding the following action items respectively:

– [Definitions] Develop specific definitions of altmetrics that are consistent with best practice in the social-science field on the development of  measures
– [Research evaluation] – Promote evaluation of the construct and predictive validity  of individual scholarly metrics, compared to  the best available evaluations of scholarly impact.
– [Data Quality and Gaming] – Promote the evaluation and documentation of the reliability of measures, their predictive validity, cost of computing, potential value of information, and susceptibility to manipulation based on the resources available, incentives, or collaboration among parties.

[1] NISO Altmetrics Standards Project White Paper, Draft 4, June 6 2014;  page 8
[2] See chapter 5-7 in Raykov, Tenko, and George A. Marcoulides. Introduction to psychometric theory. Taylor & Francis, 2010.
[3] See chapter 6 in Raykov, Tenko, and George A. Marcoulides. Introduction to psychometric theory. Taylor & Francis, 2010.
[4] See chapter 7 in Raykov, Tenko, and George A. Marcoulides. Introduction to psychometric theory. Taylor & Francis, 2010.
[5] See Morgan, Stephen L., and Christopher Winship. Counterfactuals and causal inference: Methods and principles for social research. Cambridge University Press, 2007.
[6] See Pratt, John Winsor, Howard Raiffa, and Robert Schlaifer. Introduction to statistical decision theory. MIT press, 1995.
[7] See ch 7. in Fudenberg, Drew, and Jean Tirole. “Game theory, 1991.” Cambridge, Massachusetts (1991).
[8] Schneier, Bruce. Liars and outliers: enabling the trust that society needs to thrive. John Wiley & Sons, 2012.