Sep 23, 2:57pm

My colleague,  Ben Lewis,  who is system architect and project manager for WorldMap, created at the Center for Geographic Analysis at Harvard presented this talk  as part of the Program on Information Science Brown Bag Series.  Ben is an expert in GIS systems and platforms and has developed many interesting tools in this area.

In his talk, below, Ben discusses the  WorldMap platform (, which is claimed to be the largest open source collaborative mapping system in the world, with over 13,000 map layers contributed by thousands of users from around the world. Researchers may upload large spatial datasets to the system, create data-driven visualizations, edit data, and control access. Users may keep their data private, share it in groups, or publish to the world. Ben discussed current work to create and maintain a global registry of map services and take us a step closer to one-stop-access for public geospatial data.

A number of themes ran through Ben’s presentation:

  • Space time coordinates are an organizing facet for a  huge variety of human and natural information — everything that happens, happens at a particular time and place.
  • Most of the geospatial web cannot be discovered through standard search engines. A major goal of Ben’s projects is to expose this “dark geoweb”, which he estimates to comprise millions of map layers.
  • Libraries need to be increasingly savvy about space in choosing and developing platforms for discovery and analysis, so that their clients can benefit from advances in GIS services and platforms and geospatial collections.

Sep 17, 10:03am

This talk was sponsored by the MIT Postdoctoral Association with support from the Office of the Vice President for Research.

In the rapidly changing world of research and scholarly communications researchers are faced with a rapidly growing range of options to publicly disseminate, review, and discuss research—options which will affect their long-term reputation. Junior scholars must be especially thoughtful in choosing how much effort to invest in dissemination and communication, and what strategies to use.

In this talk, I briefly discuss a number of review of bibliometric and scientometric studies of quantitative research impact, a sampling of influential qualitative writings advising this area, and an environmental scan of emerging researcher profile systems. Based on this review, and on professional experience on dozens of review panels, I suggest some steps junior researchers may consider when disseminating their research and participating in public review and discussion.

My somewhat idiosyncratic recommendations are in three categories. The tactical, strategic, and “next steps”:

Tactical Recommendations

  • Identify and use opportunities to communicate:
    • Accept invited talks, where practical
    • Announce when you will be speaking, teaching
    • Share your presentations, writings, and data
  • Create a scholarly identit
    • Obtain an ORCID, domain name, twitter handle, LinkeIn profile, Google Scholar profile
    • Create a short bio and longer CV
    • Develop a research theme, and signature idea
  • Communicate broadly
    • Publish writings as Open Access when possible
    • Publish data and software as open data and open source
    • Use social media (LinkedIN, Twitter) to announce new publications, teaching, speaking
  • Develop communications skills early
    • Take writing lessons early
    • Take public speaking lessons early
  • Monitor your impact
    • Monitor news, citation, social media metrics, and altmetrics that reflect the impact of your work
    • Keep records
    • Do this systematically, regularly, but not reactively or obsessively
  • Focus on Clarity and Significance
    • Do research that is important to you and that you think is important to the world
    • When writing about your research, work to maximize clarity – including in abstracts, titles, and citations
  • Give credit generously
    • Cite software you use
    • Cite data on which your analyses rely
    • Don’t be afraid to cite your own work
    • Discuss authorship early, and document contributions publicly

Unordered Strategic Recommendations

  • Do research that is important to you and that you think is important to the world
  • Manage your research program – find a core theme, a signature idea, and regularly review comparative strengths, comparative weaknesses, timely opportunities and future threats
  • Collaborate with people you respect, and like working with, start with small steps
  • Take a positive and sustained interest in the work and career of others, this is the foundation of professional networking
  • Make a moderate, but systematic effort to understand and monitor the institutions within which your work is embedded.
  • Identify your core strengths. Build a career around those.
  • Identify the weaknesses that are continual stumbling blocks. Make them good enough.
  • Pay attention to your world: exercise, sleep, diet, stress, relationships
  • Don’t manage your time – manage your life: know your values, choose your priorities, monitor your progress
  • Align your career with your core values

Ten Things to try right now…

Identify yourself 

1.  Register for an ORCID identifier

2. Register for information hubs: LinkedIN, Slideshare, and a domain name of your own

3. Register for Twitter

Describe yourself …
write these and post to your LinkedIN and ORCID Profiles

4. Write and share a 1-paragraph bio

5. Describe your research program in 2 paragraph

6. Create a CV


7. Share (on Twitter & LinkedIN) news about something you did or published; an upcoming event in which you will participate; interesting news  and publications in your field

8.  Make writing; data; publication; software available as Open Access (through your institutional repository, SlideShare,, Dataverse, FigShare)

check and record these things regularly, but not too frequently (once a month) — and no need to react or adjust immediately

9. Set up tracking of your citations, mentions, and topics you are interested in using  Google scholar and  Google alert,

10. Find your Klout score, H-index.

In the full presentation, I show how to gather impact data, review findings from bibliometric research on how to increase impact by choosing titles, venues, and the like; and consider the advice for success given by the scores of books I’ve scanned on this topic.

The full presentation is available here:


Jul 22, 11:13am

To summarize, altmetrics  should  build on existing statistical and social science methods for developing reliable measures. The draft white paper from the NISO altmetrics project suggests many interesting potential action items, but does not yet incorporate, suggest or reference a framework for systematic definition or evaluation of  metrics.

NISO offered a recent opportunity to comment on the draft recommendation on their ‘Altmetrics Standards Project’. MIT is a non-voting NISO member, and I am the current ‘representative’ to NISO. The following is my commentary, on the draft recommendation. You may also be interested in reading the other commentaries on this draft.

Response to request for public comments on on ‘NISO Altmetrics Standards Project White Paper ’

Scholarly metrics should be broadly understood as measurement constructs applied to the domain of scholarly/research (broadly, any form of rigorous enquiry), outputs, actors, impacts (i.e. broader consequences), and the relationships among them. Most traditional formal scholarly metrics, such as the H-Index, Journal impact Factor, and citation count, are relatively simple summary statistics applied to the attributes of a corpus of bibliographic citations extracted from a selection of peer-reviewed journals. The Altmetrics movement aims to develop more sophisticated measures, based on a broader set of attributes, and covering a deeper corpus of outputs.

As the Draft aptly notes, in general our current scholarly metrics, and the decision systems around them are far from rigorous: “Unfortunately, the scientific rigor applied to using these numbers for evaluation is often far below the rigor scholars use in their own scholarship.” [1]

The Draft takes a step towards a more rigorous understanding of alt metrics. It’s primary contribution is to suggest a set of potential action items to increase clarity and understanding.

However, the Draft does not yet identify either the key elements of a rigorous (or systematic) foundation for defining scholarly metrics, their properties, and quality. Nor does the Draft identify key research in evaluation and measurement that provide a potential foundation. The aim of these comments is to start to fill this structural.

Informally speaking, good scholarly metrics are fit for use in a scholarly incentive system. More formally, most scholarly metrics are parts of larger evaluation and incentive systems, where the metric is used to support descriptive and predictive/causal inference, in support of some decision.

Defining metrics formally in this way also helps to clarify what characteristics of metrics are important for determining their quality and usefulness.

– Characteristics supporting any inference. Classical test theory is well developed in this area. [2] Useful metric supports some form of inference, and reliable inference requires reliablilty.[3]  Informally, good metrics should yield the similar results across  repeated measurements of the same purported phenomenon.
– Characteristics supporting descriptive inference. Since an objective of most incentive systems is descriptive, good measures must have appropriate measurement validity. [4] In informal terms, all measures should be internally consistent;  and the metric should be related to the concept being measured.
– Characteristics supporting prediction or intervention. Since objective of most incentive systems is both descriptive and predictive/causal inference, good measures must aid accurate and unbiased  inference. [5] In informal terms, the metric should demonstrably be able to increase the accuracy of predicting something relevant to scholarly evaluation.
– Characteristics supporting decisions. Decision theory is well developed in this area [6]: The usefulness of metrics is dependent on the cost of computing the metric, and the value of the information that the metric produces. The value of the information depends on the expected value of the optimal decisions that would be produced with and without that information. In informal terms, good metrics provide information that helps one avoid costly mistakes, and good metrics cost less than the expected of the mistakes one avoids by using them.
– Characteristics supporting evaluation systems. This is a more complex area, but the field of game theory and mechanism design are most relevant.  Measures that are used in a strategic context must be resistant to manipulation — either (a) requiring extensive resources to manipulate, (b) requiring extensive coordination across independent actors to manipulate, or by (c) inventing truthful revelation. Trust engineering is another relevant area — characteristics such as transparency, monitoring, and punishment of bad behavior, among other systems factors, may have substantial effects. [8]

The above characteristics comprise a large part of the scientific basis for assessing the quality and usefulness of scholarly metrics. They are necessarily abstract, but closely related to the categories of action items already in the report. In particular to Definitions; Research Evaluation; Data Quality; and Grouping. Specifically, we recommend adding the following action items respectively:

– [Definitions] Develop specific definitions of altmetrics that are consistent with best practice in the social-science field on the development of  measures
– [Research evaluation] – Promote evaluation of the construct and predictive validity  of individual scholarly metrics, compared to  the best available evaluations of scholarly impact.
– [Data Quality and Gaming] – Promote the evaluation and documentation of the reliability of measures, their predictive validity, cost of computing, potential value of information, and susceptibility to manipulation based on the resources available, incentives, or collaboration among parties.

[1] NISO Altmetrics Standards Project White Paper, Draft 4, June 6 2014;  page 8
[2] See chapter 5-7 in Raykov, Tenko, and George A. Marcoulides. Introduction to psychometric theory. Taylor & Francis, 2010.
[3] See chapter 6 in Raykov, Tenko, and George A. Marcoulides. Introduction to psychometric theory. Taylor & Francis, 2010.
[4] See chapter 7 in Raykov, Tenko, and George A. Marcoulides. Introduction to psychometric theory. Taylor & Francis, 2010.
[5] See Morgan, Stephen L., and Christopher Winship. Counterfactuals and causal inference: Methods and principles for social research. Cambridge University Press, 2007.
[6] See Pratt, John Winsor, Howard Raiffa, and Robert Schlaifer. Introduction to statistical decision theory. MIT press, 1995.
[7] See ch 7. in Fudenberg, Drew, and Jean Tirole. “Game theory, 1991.” Cambridge, Massachusetts (1991).
[8] Schneier, Bruce. Liars and outliers: enabling the trust that society needs to thrive. John Wiley & Sons, 2012.

Jul 17, 9:12am

My colleague, (Merrick) Lex Berman,  who is Web Service Manager & GIS Specialist, at the Center for Geographic Analysis at Harvard presented this  as part of the Program on Information Science Brown Bag Series.  Lex is an expert in applications related to digital humanities, GIS, and Chinese history — and has developed many interesting tools in this area.

In his talk, Lex notes how the library catalog has evolved from the description of items in physical collections into a wide-reaching net of services and tools for managing both physical collections and networked resources:  The line between descriptive metadata and actual content is becoming blurred.   Librarians and catalogers are now in the position of being not only docents of collections, but innovators in digital research, and this opens up a number of opportunities for retooling library discovery tools.   His presentation will presented survey of methods and projects that have extended traditional catalogs of libraries and museums into online collections of digital objects in the field of humanities — focusing on projects that use historical place names and geographic identifiers for linked open data will be discussed.

A number of themes ran through Lex’s presentation: One theme is the unbinding of information — how collections are split into pieces that can be repurposed, but which also need to be linked to their context to remain understandable. Another theme is that knowledge is no longer bounded, footnotes and references are no longer stopping points, from the point of view of the user, all collections are unbounded, and the line between references to information and the information itself has become increasingly blurred. A third theme was the pervasiveness of information about place and space — all human activity takes place within a specific context of time and space, and  implicit references to places exist in many places in the library catalog such as in the  titles, and descriptions of works. A fourth them is that user expectations are changing – they expect instant, machine -readable information, geospatial information, mapping, and facetting as a matter of course.

Lex suggested a number of  entry points for Libraries to investigate and pilot spatial discovery:

  • Build connections to existing catalogs, which already have implicit reference to space and place
  • Expose information through simple API’s and formats, like GEORSS
  • Use and contribute to open services like gazetteers


Jul 08, 8:11am

Tracking scholarly outputs has always been a part of the academic enterprise. However the dramatic increase in publication and collaboration over the last three decades is driving new, more scaleable approaches. A central challenge to understanding the rapidly growing scholarly universe is the problem of collecting complete and unambiguous data on who (among researchers, scholars, students and other members of the enterprise) has contributed in what ways  to what outputs (e.g., articles, data, software, patents) with the support of which institutions (e.g. as funders, host institutions, publishers).  In short, a full understanding of research requires those involved in the research enterprise to use public, reliable identifiers.

In June, I had the pleasure of speaking on a panel at the “Twelfth Annual ARIES EMUG Users Group Meeting” that aimed to provide an overview of the major new trends in the area of scholarly identifiers.

The presentation embedded below provides an overview of ORCID researcher identifiers; their role in integrating systems for managing, evaluating, and tracking scholarly outputs; and the broader integration of researcher identifiers with publication, funder, and institutional identifiers.

Most of the credit for the presentation itself is due to ORCID Executive Director Laure Haak who developed the majority of the presentation materials — those which describe ORCID and developments around it. And there are indeed many ORCID-related developments to relate.

My additions attempt to sum up the larger context, in which ORCID, and researcher identifiers play a key role.

It has been widely remarked that the sheer number of publications and researchers has grown dramatically over the last three decades. And it is not simply the numbers that are changing. Authors are changing — increasingly students, “citizen-scientists”, software developers, data curators and others author or make substantial intellectual contributions to, scholarly works. Authorship is changing — science, and the creations of scientific outputs involves wider collaborations, and a wider potential variety of research roles. Scholarly works are changing — recognized outputs of scholarship   not only include traditional research articles and books, but also datasets, nano-publications, software,  videos, and dynamic “digital scholarship”. And evaluation is changing to reflect the increasing volume, granularity, and richness of measures available, and the increasing sophistication of statistical and computational methods for network and textual analysis.

The tools, methods, and infrastructure for  tracking, evaluating, attributing, understanding patterns of scholarship are under pressure to adapt to these changes. ORCID is part of this — it is a key tool for adapting to changes in the scale and nature of scholarly production.  It’s a community-based system for researcher identification, based on standardized definitions, open source, an open API, and open data.

ORCID provides a mechanism for robustly identification of researchers – it aims to solve the problem of understanding the  “who”  in research. Increasingly, ORCID is also integrating with solutions to address the “which”, and “what”.

Effective sustained long-term integration of multiple domains requires work at multiple levels:

  • At the abstract level, integration involves the coordination of vocabularies, schemas, taxonomies or ontologies that link or cross domain boundaries.
  • At the systems level, integration requires accessible API’s that provide hooks to access domain specific identifiers, linkages, or content.
  • At the user level, integration requires human-computer-interface design must expose and domain-specific information, and leverage this to increase ease-of-use and data integrity, and support and document needs to be available.
  • At the organizational level, integration requires engagement with the evolution of standards and implementation, and organizations driving these,  in other domains. Especially in this rapidly changing ecosystem, one must frequently monitor integration points to anticipate or mitigate incompatible changes.

ORCID is making rapid progress in integrating with systems that address the “which” of research. ORCID id’s are now integrated into manuscript management systems and publisher’s workflows  and CrossRef DOI indexing with the result that these id’s are now increasingly part of the core metadata associated with publications.

ORCID now uses standard Ringold identifiers to identify institutions such as employers. (Ringold identifiers are in the process of being mapped to ISNI institutional identifiers as well — which will further integrate ORCID and ISNI.) These institutional identifiers are seamlessly integrated into the ORCID UI which help users of ORCID auto-complete institutional names, and increases data integrity. These ID’s are part of of the ORCID schema and exposed through the open API . And ORCID engages with Ringold on an institutional level so that institutional identifiers can be added on the request of ORCID members.

Similarly, ORCID now uses FundRef identifiers to identify funding agencies and awards. These too are integrated at points in the UI, schema, and API. Search and link wizards can push FundRef identifier into the ORCID registry along with other information about each award.

Full integration of data across the next-generation of the scholarly ecosystem will involve more of the “what” of research. This includes associating publication, institutional, and individual identifiers with a wider variety of scholarly outputs, including data sets and software; and  developing standardized information about the types of relationships among outputs, institutions, and people — particularly the many different types and degrees of contribution that  members of collaborations make to research and to its products.

ORCID has been taking steps in this direction, including a DataCite – search and link wizard for datasets, working to expand work-types supported in the ORCID schemas, and working with the community to develop and enhance existing schemas and workflows; and working with CASRAI to develop an approach to embedding researcher identifiers into peer review.   This is just the tip of the iceberg, however, and the scholarly ecosystem has considerable ways to go before it will reflect the many emerging forms of scholarly outputs and roles that contributors take in relation to these.

Jun 18, 8:01am

Analysis of dozens of publicly created redistricting plans shows that map-making technology can improve political representation and detect a gerrymander.

n 2012, President Obama won the vote in Ohio by three percentage points, while Republicans held a 13-to-5 majority in Ohio’s delegation to the U.S. House. After redistricting in 2013, Republicans held 12 of Ohio’s House seats while Democrats held four. As is typical in these races, few were competitive; the average margin of victory was 32 points. Is this simply a result of demography, the need to create a majority-minority district, and the constraints traditional redistricting principles impose on election lines—or did the legislature intend to create a gerrymander?

To find out… read more at TechTank

May 09, 10:01am

My collaborator, Dr. Mercè Crosas,  who is Director of Data Science in the Institute of Quantitative Social Sciences (IQSS) at Harvard services presented this talk as part of the Program on Information Science Brown Bag Series.

The Dataverse software provides multiple workflows for data publishing to support a wide range of data policies and practices established by journals, as well as data sharing needs from various research communities. This talk will describe these workflows from the user experience and from the system’s technical implementation.
Dr. Crosas discussed the portfolio of tools being developed at IQSS, including DataVerse, DataTags, Rbuild, TwoRavens, and Zelig. (The Program on Information Science is pleased to be collaborating in some of these efforts, including a project to integrate DataVerse and OJS, and to  develop a wide set of privacy tools that connect with DataTags — see the Program project portfolio.) Shergued that data publishing should be based on three ‘pillars’:
  1. Trusted data repositories that guarantee long-term access
  2. Mechanisms for formal data citation
  3. Sufficient information to understand and reuse the data (e.g. metadata, documentation, code)

It is interesting to consider the extent to which current tools and practices support these pillars: There have been many recent efforts focused trusted repositories and data citation.  And many of the projects described by Dr. Crosas promise to advance the state of the practice in these areas.

Determining what is sufficient information for understanding and reuse, and making that information easy to extract from the research process and research process, to record in structured ways, and to expose and present to different audiences seems to be an area with many opportunities, and few broad solution. And it is also interesting to consider the extent to which the workflows for publishing data integrate with the workflows managing data during research,  prior to publication —  I discuss some of these tools and integration points in a previous post

May 01, 5:10pm

More and more frequently, research and scholarly publications involve collaboration, and co-authorship in most scholarly fields is rapidly increasing in frequency and complexity. Yet co-authorship patterns vary across discipline; the interpretation of author ordering is fraught with ambiguity; and richer textual descriptions of authorship are uncommon, and difficult to interpret, aggregate, and evaluate at scale.

There is a growing interest across stakeholders in scholarship in increasing the transparency of research contributions and in identifying the specific roles that contributors play in creating scholarly works. Understanding and accommodating emerging forms of co-authorship is critical for managing intellectual property, publication ethics, and effective evaluation.

The slides below provide an overview of the research questions in this area, and was originally presented as part of the Program on Information Science Brown Bag Series. Our upcoming panel at the Society for Scholarly Publishing, will to explore these issues further. 


Apr 17, 11:26am

Whose articles cite a body of work? Is this a high-impact journal? How might others assess my scholarly impact? Citation analysis is one of the primary methods used to answer these questions. Academics, publishers, and funders often study the patterns of citations in the academic literature in order to explore the relationships among researchers, topics, and publications, and to measure the impact of articles,  journals, and individuals.

MIT has a wonderful tradition of offering a variety of short courses during the Winter hiatus, known as IAP. The MIT Libraries generally offers dozens of courses on data and information management (among other topics) during this time. Moreover, the Libraries also hold bonus IAP sessions in April and July.

For IAPril, the Program on Information Science has introduced a new course on citation and bibliometric analysis, this provides a review of and citation and altmetric data; best-of-class free/open tools for analysis; and classes of measures.

The full (and updated) slides are below:


In addition a short summary of MIT resources, as summarized in a presentation to library Liaisons is available here:

Registration is available on the class site:

Apr 09, 3:53pm

I was honored to present some of our research at a panel on election reform at this year’s Harvard Law and Policy Review symposium .

As a summary of this the Harvard Law and Policy Review’s Notice and Comment Blog  has published a concise summary of recommendation for redistricting reform by my collaborator Michael McDonald and I, entitled:

Create Real Redistricting Reform through Internet-Scale Independent Commissions

To quote from the HLPR summary:

Twenty-first century redistricting should incorporate  transparency at internet speed and scale  – open source, open data, open process (see here for in-depth recommendations) — and twenty-first century redistricting should incorporate internet technology for twenty-first century participation: direct access to the redistricting process; access to legal-strength mapping tools; and the integration of  crowd-sourcing to create maps, identify communities and neighborhoods, collect and correct data, and gather and analyze public commentary.

There are few policy arenas where the public can fashion legitimate proposals that rival what their elected officials enact. Redistricting is among them, so why not enable greater public participation in this critical democratic process?

Read the rest of these recommendations on the Harvard Law and Policy Review — Notice and Comment site


(And n a related topic  this previous post summarizes some of our research on  crowd sourced mapping for open government .)