Blog

Jul 22, 11:13am

To summarize, altmetrics  should  build on existing statistical and social science methods for developing reliable measures. The draft white paper from the NISO altmetrics project suggests many interesting potential action items, but does not yet incorporate, suggest or reference a framework for systematic definition or evaluation of  metrics.

NISO offered a recent opportunity to comment on the draft recommendation on their ‘Altmetrics Standards Project’. MIT is a non-voting NISO member, and I am the current ‘representative’ to NISO. The following is my commentary, on the draft recommendation. You may also be interested in reading the other commentaries on this draft.

Response to request for public comments on on ‘NISO Altmetrics Standards Project White Paper ’

Scholarly metrics should be broadly understood as measurement constructs applied to the domain of scholarly/research (broadly, any form of rigorous enquiry), outputs, actors, impacts (i.e. broader consequences), and the relationships among them. Most traditional formal scholarly metrics, such as the H-Index, Journal impact Factor, and citation count, are relatively simple summary statistics applied to the attributes of a corpus of bibliographic citations extracted from a selection of peer-reviewed journals. The Altmetrics movement aims to develop more sophisticated measures, based on a broader set of attributes, and covering a deeper corpus of outputs.

As the Draft aptly notes, in general our current scholarly metrics, and the decision systems around them are far from rigorous: “Unfortunately, the scientific rigor applied to using these numbers for evaluation is often far below the rigor scholars use in their own scholarship.” [1]

The Draft takes a step towards a more rigorous understanding of alt metrics. It’s primary contribution is to suggest a set of potential action items to increase clarity and understanding.

However, the Draft does not yet identify either the key elements of a rigorous (or systematic) foundation for defining scholarly metrics, their properties, and quality. Nor does the Draft identify key research in evaluation and measurement that provide a potential foundation. The aim of these comments is to start to fill this structural.

Informally speaking, good scholarly metrics are fit for use in a scholarly incentive system. More formally, most scholarly metrics are parts of larger evaluation and incentive systems, where the metric is used to support descriptive and predictive/causal inference, in support of some decision.

Defining metrics formally in this way also helps to clarify what characteristics of metrics are important for determining their quality and usefulness.

– Characteristics supporting any inference. Classical test theory is well developed in this area. [2] Useful metric supports some form of inference, and reliable inference requires reliablilty.[3]  Informally, good metrics should yield the similar results across  repeated measurements of the same purported phenomenon.
– Characteristics supporting descriptive inference. Since an objective of most incentive systems is descriptive, good measures must have appropriate measurement validity. [4] In informal terms, all measures should be internally consistent;  and the metric should be related to the concept being measured.
– Characteristics supporting prediction or intervention. Since objective of most incentive systems is both descriptive and predictive/causal inference, good measures must aid accurate and unbiased  inference. [5] In informal terms, the metric should demonstrably be able to increase the accuracy of predicting something relevant to scholarly evaluation.
– Characteristics supporting decisions. Decision theory is well developed in this area [6]: The usefulness of metrics is dependent on the cost of computing the metric, and the value of the information that the metric produces. The value of the information depends on the expected value of the optimal decisions that would be produced with and without that information. In informal terms, good metrics provide information that helps one avoid costly mistakes, and good metrics cost less than the expected of the mistakes one avoids by using them.
– Characteristics supporting evaluation systems. This is a more complex area, but the field of game theory and mechanism design are most relevant.  Measures that are used in a strategic context must be resistant to manipulation — either (a) requiring extensive resources to manipulate, (b) requiring extensive coordination across independent actors to manipulate, or by (c) inventing truthful revelation. Trust engineering is another relevant area — characteristics such as transparency, monitoring, and punishment of bad behavior, among other systems factors, may have substantial effects. [8]

The above characteristics comprise a large part of the scientific basis for assessing the quality and usefulness of scholarly metrics. They are necessarily abstract, but closely related to the categories of action items already in the report. In particular to Definitions; Research Evaluation; Data Quality; and Grouping. Specifically, we recommend adding the following action items respectively:

– [Definitions] Develop specific definitions of altmetrics that are consistent with best practice in the social-science field on the development of  measures
– [Research evaluation] – Promote evaluation of the construct and predictive validity  of individual scholarly metrics, compared to  the best available evaluations of scholarly impact.
– [Data Quality and Gaming] – Promote the evaluation and documentation of the reliability of measures, their predictive validity, cost of computing, potential value of information, and susceptibility to manipulation based on the resources available, incentives, or collaboration among parties.

[1] NISO Altmetrics Standards Project White Paper, Draft 4, June 6 2014;  page 8
[2] See chapter 5-7 in Raykov, Tenko, and George A. Marcoulides. Introduction to psychometric theory. Taylor & Francis, 2010.
[3] See chapter 6 in Raykov, Tenko, and George A. Marcoulides. Introduction to psychometric theory. Taylor & Francis, 2010.
[4] See chapter 7 in Raykov, Tenko, and George A. Marcoulides. Introduction to psychometric theory. Taylor & Francis, 2010.
[5] See Morgan, Stephen L., and Christopher Winship. Counterfactuals and causal inference: Methods and principles for social research. Cambridge University Press, 2007.
[6] See Pratt, John Winsor, Howard Raiffa, and Robert Schlaifer. Introduction to statistical decision theory. MIT press, 1995.
[7] See ch 7. in Fudenberg, Drew, and Jean Tirole. “Game theory, 1991.” Cambridge, Massachusetts (1991).
[8] Schneier, Bruce. Liars and outliers: enabling the trust that society needs to thrive. John Wiley & Sons, 2012.


Jul 17, 9:12am

My colleague, (Merrick) Lex Berman,  who is Web Service Manager & GIS Specialist, at the Center for Geographic Analysis at Harvard presented this  as part of the Program on Information Science Brown Bag Series.  Lex is an expert in applications related to digital humanities, GIS, and Chinese history — and has developed many interesting tools in this area.

In his talk, Lex notes how the library catalog has evolved from the description of items in physical collections into a wide-reaching net of services and tools for managing both physical collections and networked resources:  The line between descriptive metadata and actual content is becoming blurred.   Librarians and catalogers are now in the position of being not only docents of collections, but innovators in digital research, and this opens up a number of opportunities for retooling library discovery tools.   His presentation will presented survey of methods and projects that have extended traditional catalogs of libraries and museums into online collections of digital objects in the field of humanities — focusing on projects that use historical place names and geographic identifiers for linked open data will be discussed.

A number of themes ran through Lex’s presentation: One theme is the unbinding of information — how collections are split into pieces that can be repurposed, but which also need to be linked to their context to remain understandable. Another theme is that knowledge is no longer bounded, footnotes and references are no longer stopping points, from the point of view of the user, all collections are unbounded, and the line between references to information and the information itself has become increasingly blurred. A third theme was the pervasiveness of information about place and space — all human activity takes place within a specific context of time and space, and  implicit references to places exist in many places in the library catalog such as in the  titles, and descriptions of works. A fourth them is that user expectations are changing – they expect instant, machine -readable information, geospatial information, mapping, and facetting as a matter of course.

Lex suggested a number of  entry points for Libraries to investigate and pilot spatial discovery:

  • Build connections to existing catalogs, which already have implicit reference to space and place
  • Expose information through simple API’s and formats, like GEORSS
  • Use and contribute to open services like gazetteers

 


Jul 08, 8:11am

Tracking scholarly outputs has always been a part of the academic enterprise. However the dramatic increase in publication and collaboration over the last three decades is driving new, more scaleable approaches. A central challenge to understanding the rapidly growing scholarly universe is the problem of collecting complete and unambiguous data on who (among researchers, scholars, students and other members of the enterprise) has contributed in what ways  to what outputs (e.g., articles, data, software, patents) with the support of which institutions (e.g. as funders, host institutions, publishers).  In short, a full understanding of research requires those involved in the research enterprise to use public, reliable identifiers.

In June, I had the pleasure of speaking on a panel at the “Twelfth Annual ARIES EMUG Users Group Meeting” that aimed to provide an overview of the major new trends in the area of scholarly identifiers.

The presentation embedded below provides an overview of ORCID researcher identifiers; their role in integrating systems for managing, evaluating, and tracking scholarly outputs; and the broader integration of researcher identifiers with publication, funder, and institutional identifiers.

Most of the credit for the presentation itself is due to ORCID Executive Director Laure Haak who developed the majority of the presentation materials — those which describe ORCID and developments around it. And there are indeed many ORCID-related developments to relate.

My additions attempt to sum up the larger context, in which ORCID, and researcher identifiers play a key role.

It has been widely remarked that the sheer number of publications and researchers has grown dramatically over the last three decades. And it is not simply the numbers that are changing. Authors are changing — increasingly students, “citizen-scientists”, software developers, data curators and others author or make substantial intellectual contributions to, scholarly works. Authorship is changing — science, and the creations of scientific outputs involves wider collaborations, and a wider potential variety of research roles. Scholarly works are changing — recognized outputs of scholarship   not only include traditional research articles and books, but also datasets, nano-publications, software,  videos, and dynamic “digital scholarship”. And evaluation is changing to reflect the increasing volume, granularity, and richness of measures available, and the increasing sophistication of statistical and computational methods for network and textual analysis.

The tools, methods, and infrastructure for  tracking, evaluating, attributing, understanding patterns of scholarship are under pressure to adapt to these changes. ORCID is part of this — it is a key tool for adapting to changes in the scale and nature of scholarly production.  It’s a community-based system for researcher identification, based on standardized definitions, open source, an open API, and open data.

ORCID provides a mechanism for robustly identification of researchers – it aims to solve the problem of understanding the  “who”  in research. Increasingly, ORCID is also integrating with solutions to address the “which”, and “what”.

Effective sustained long-term integration of multiple domains requires work at multiple levels:

  • At the abstract level, integration involves the coordination of vocabularies, schemas, taxonomies or ontologies that link or cross domain boundaries.
  • At the systems level, integration requires accessible API’s that provide hooks to access domain specific identifiers, linkages, or content.
  • At the user level, integration requires human-computer-interface design must expose and domain-specific information, and leverage this to increase ease-of-use and data integrity, and support and document needs to be available.
  • At the organizational level, integration requires engagement with the evolution of standards and implementation, and organizations driving these,  in other domains. Especially in this rapidly changing ecosystem, one must frequently monitor integration points to anticipate or mitigate incompatible changes.

ORCID is making rapid progress in integrating with systems that address the “which” of research. ORCID id’s are now integrated into manuscript management systems and publisher’s workflows  and CrossRef DOI indexing with the result that these id’s are now increasingly part of the core metadata associated with publications.

ORCID now uses standard Ringold identifiers to identify institutions such as employers. (Ringold identifiers are in the process of being mapped to ISNI institutional identifiers as well — which will further integrate ORCID and ISNI.) These institutional identifiers are seamlessly integrated into the ORCID UI which help users of ORCID auto-complete institutional names, and increases data integrity. These ID’s are part of of the ORCID schema and exposed through the open API . And ORCID engages with Ringold on an institutional level so that institutional identifiers can be added on the request of ORCID members.

Similarly, ORCID now uses FundRef identifiers to identify funding agencies and awards. These too are integrated at points in the UI, schema, and API. Search and link wizards can push FundRef identifier into the ORCID registry along with other information about each award.

Full integration of data across the next-generation of the scholarly ecosystem will involve more of the “what” of research. This includes associating publication, institutional, and individual identifiers with a wider variety of scholarly outputs, including data sets and software; and  developing standardized information about the types of relationships among outputs, institutions, and people — particularly the many different types and degrees of contribution that  members of collaborations make to research and to its products.

ORCID has been taking steps in this direction, including a DataCite – search and link wizard for datasets, working to expand work-types supported in the ORCID schemas, and working with the community to develop and enhance existing schemas and workflows; and working with CASRAI to develop an approach to embedding researcher identifiers into peer review.   This is just the tip of the iceberg, however, and the scholarly ecosystem has considerable ways to go before it will reflect the many emerging forms of scholarly outputs and roles that contributors take in relation to these.


Jun 18, 8:01am

Analysis of dozens of publicly created redistricting plans shows that map-making technology can improve political representation and detect a gerrymander.

n 2012, President Obama won the vote in Ohio by three percentage points, while Republicans held a 13-to-5 majority in Ohio’s delegation to the U.S. House. After redistricting in 2013, Republicans held 12 of Ohio’s House seats while Democrats held four. As is typical in these races, few were competitive; the average margin of victory was 32 points. Is this simply a result of demography, the need to create a majority-minority district, and the constraints traditional redistricting principles impose on election lines—or did the legislature intend to create a gerrymander?

To find out… read more at TechTank


May 09, 10:01am

My collaborator, Dr. Mercè Crosas,  who is Director of Data Science in the Institute of Quantitative Social Sciences (IQSS) at Harvard services presented this talk as part of the Program on Information Science Brown Bag Series.

The Dataverse software provides multiple workflows for data publishing to support a wide range of data policies and practices established by journals, as well as data sharing needs from various research communities. This talk will describe these workflows from the user experience and from the system’s technical implementation.
Dr. Crosas discussed the portfolio of tools being developed at IQSS, including DataVerse, DataTags, Rbuild, TwoRavens, and Zelig. (The Program on Information Science is pleased to be collaborating in some of these efforts, including a project to integrate DataVerse and OJS, and to  develop a wide set of privacy tools that connect with DataTags — see the Program project portfolio.) Shergued that data publishing should be based on three ‘pillars’:
  1. Trusted data repositories that guarantee long-term access
  2. Mechanisms for formal data citation
  3. Sufficient information to understand and reuse the data (e.g. metadata, documentation, code)

It is interesting to consider the extent to which current tools and practices support these pillars: There have been many recent efforts focused trusted repositories and data citation.  And many of the projects described by Dr. Crosas promise to advance the state of the practice in these areas.

Determining what is sufficient information for understanding and reuse, and making that information easy to extract from the research process and research process, to record in structured ways, and to expose and present to different audiences seems to be an area with many opportunities, and few broad solution. And it is also interesting to consider the extent to which the workflows for publishing data integrate with the workflows managing data during research,  prior to publication —  I discuss some of these tools and integration points in a previous post


May 01, 5:10pm

More and more frequently, research and scholarly publications involve collaboration, and co-authorship in most scholarly fields is rapidly increasing in frequency and complexity. Yet co-authorship patterns vary across discipline; the interpretation of author ordering is fraught with ambiguity; and richer textual descriptions of authorship are uncommon, and difficult to interpret, aggregate, and evaluate at scale.

There is a growing interest across stakeholders in scholarship in increasing the transparency of research contributions and in identifying the specific roles that contributors play in creating scholarly works. Understanding and accommodating emerging forms of co-authorship is critical for managing intellectual property, publication ethics, and effective evaluation.

The slides below provide an overview of the research questions in this area, and was originally presented as part of the Program on Information Science Brown Bag Series. Our upcoming panel at the Society for Scholarly Publishing, will to explore these issues further. 

 


Apr 17, 11:26am

Whose articles cite a body of work? Is this a high-impact journal? How might others assess my scholarly impact? Citation analysis is one of the primary methods used to answer these questions. Academics, publishers, and funders often study the patterns of citations in the academic literature in order to explore the relationships among researchers, topics, and publications, and to measure the impact of articles,  journals, and individuals.

MIT has a wonderful tradition of offering a variety of short courses during the Winter hiatus, known as IAP. The MIT Libraries generally offers dozens of courses on data and information management (among other topics) during this time. Moreover, the Libraries also hold bonus IAP sessions in April and July.

For IAPril, the Program on Information Science has introduced a new course on citation and bibliometric analysis, this provides a review of and citation and altmetric data; best-of-class free/open tools for analysis; and classes of measures.

The full (and updated) slides are below:

 

In addition a short summary of MIT resources, as summarized in a presentation to library Liaisons is available here:

Registration is available on the class site:

 informatics.mit.edu/classes/overview-citation-analysis


Apr 09, 3:53pm

I was honored to present some of our research at a panel on election reform at this year’s Harvard Law and Policy Review symposium .

As a summary of this the Harvard Law and Policy Review’s Notice and Comment Blog  has published a concise summary of recommendation for redistricting reform by my collaborator Michael McDonald and I, entitled:

Create Real Redistricting Reform through Internet-Scale Independent Commissions

To quote from the HLPR summary:

Twenty-first century redistricting should incorporate  transparency at internet speed and scale  – open source, open data, open process (see here for in-depth recommendations) — and twenty-first century redistricting should incorporate internet technology for twenty-first century participation: direct access to the redistricting process; access to legal-strength mapping tools; and the integration of  crowd-sourcing to create maps, identify communities and neighborhoods, collect and correct data, and gather and analyze public commentary.

There are few policy arenas where the public can fashion legitimate proposals that rival what their elected officials enact. Redistricting is among them, so why not enable greater public participation in this critical democratic process?

Read the rest of these recommendations on the Harvard Law and Policy Review — Notice and Comment site

 


(And n a related topic  this previous post summarizes some of our research on  crowd sourced mapping for open government .)


Mar 31, 10:56am

Big data has huge implications for privacy, as summarized in our commentary below:

Both the government and third parties have the potential to collect extensive (sometimes exhaustive), fine grained, continuous, and identifiable records of a person’s location, movement history, associations and interactions with others, behavior, speech, communications, physical and medical conditions, commercial transactions, etc. Such “big data” has the ability to be used in a wide variety of ways, both positive and negative. Examples of potential applications include improving government and organizational transparency and accountability, advancing research and scientific knowledge, enabling businesses to better serve their customers, allowing systematic commercial and non-commercial manipulation, fostering pervasive discrimination, and surveilling public and private spheres.

On January 23, 2014, President Obama asked John Podesta to develop in 90 days, a ‘comprehensive review’ on big data and privacy.

This lead to a series of workshop on big data and  technology at MIT, and on social cultural & ethical dimensions at NYU, with a third planned to discuss legal issues at Berkeley. A number of colleagues from our Privacy Tools for Research  project and from the BigData@CSAIL projects have contributed to these workshops  and raised many thoughtful issues (and the workshop sessions are  online and well worth watching).

EPIC, ARL and 22 other privacy organizations requested an opportunity to comment on the report, and OSTP later allowed for a 27-day commentary period during which brief comments would be accepted by e-mail. (Note that the original RFI provided by OSTP is, at the time of this writing, a broken link, so we have posted a copy. )They requested commenters provide specific answers to five extraordinarily broad questions:

  1. What are the public policy implications of the collection, storage, analysis, and use of big data? For example, do the current U.S. policy framework and privacy proposals for protecting consumer privacy and government use of data adequately address issues raised by big data analytics? 
  2. What types of uses of big data [are most important]… could measurably improve outcomes or productivity with further government action, funding, or research? What types of uses of big data raise the most public policy concerns? Are there specific sectors or types of uses that should receive more government and/or public attention?
  3. What technological trends or key technologies will affect the collection, storage, analysis and use of big data? Are there particularly promising technologies or new practices for safeguarding privacy while enabling effective uses of big data?
  4. How should the policy frameworks or regulations for handling big data differ between the government and the private sector? Please be specific as to the type of entity and type of use (e.g., law enforcement, government services, commercial, academic research, etc.). 
  5. What issues are raised by the use of big data across jurisdictions, such as the adequacy of current international laws, regulations, or norms? 

 My colleagues at the Berkman Center, David O’Brien, Alexandra Woods, Salil Vadhan and I have submitted responses to these questions that outline a broad, comprehensive, and systematic framework for analyzing these types of questions and taxonomize a variety of modern technological, statistical, and cryptographic approaches to simultaneously providing privacy and utility. This comment is made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.

Much can be improved in how big data and is currently treated. To summarize (quoting from the conclusions of the comment):

Addressing privacy risks requires a sophisticated approach, and the privacy protections currently used for big data do not take advantage of advances in data privacy research or the nuances these provide in dealing with different kinds of data and closely matching sensitivity to risk.

I invite you to read the full comment .


Mar 27, 5:45pm

Scholarly publishers, research funders, universities, and the media, are increasingly scrutinizing research outputs. Of major concern is the integrity, reliability, and extensibility of the evidence on which published findings are based. A flood of new funder mandates, journal policies, university efforts, and professional society initiatives aim to make this data verifiable, reliable, and reusable: If “data is the new oil”, we need data management to prevent ‘fires’, ensure ‘high-octane’, and enable ‘recycling’.

In March, I had the pleasure of being the inaugural speaker in a new lecture series (http://library.wustl.edu/research-data-testing/dss_speaker/dss_altman.html) initiated by the Libraries at the Washington University in St. Louis Libraries — dedicated to the topics of data reproducibility, citation, sharing, privacy, and management.

In the presentation embedded below, I provide an overview of the major categories of new initiatives to promote research reproducibility, reliability, and reuse and related state of the art in informatics methods for managing data.

This blog post provides some wider background for the presentation, and a recap of its recommendations. The approaches can be roughly divided into three categories. The first approach focuses on tools for reproducible computation ranging from “statistical documents” (incorporating Knuth’s [1992] concept of literate programming) to workflow systems and reproducible computing environments [for example, Buckheit & Donoho 1995; Schwab et al. 2000; Leisch & Rossini 2003; Deelman & Gils 2006; Gentleman & Temple-Lang 2007] With few exceptions [notably, Freire, et al. 2006] this focuses primarily on “simple replication” or “reproduction” –replicating exactly a precise set of result from an exact copy of original data made at the time of research.

Current leading examples of tools that support reproducible computation include:

 

The second approach focuses on data sharing methods and tools [see for example, Altman et al 2001; King 2007; Anderson et al., 2007; Crosas 2011]. [1]  This approaches more generally on helping researchers to share — both for replication and for broader reuse – including secondary uses and use in teaching. Increasingly work in this area [e.g. Gutmann 2009; Altman-King 2007] focuses on issues of enabling long-term and interdisciplinary access to data – this requires that the researchers’ tacit knowledge about data formats, measurement, structure and provenance be more explicitly documented.

Current leading examples of informatics tools that support data sharing include:

 

The third approach focuses on the norms, practices and licensing associated with data sharing archiving and replication and the related incentives embedded in scholarly communication [Pienta 2007; Hamermesh 2007; Altman & King 2007; King 2007; Hedstrom et al. 2008; McCullough 2009; Stodden 2009].  This approach seeks to create the necessary conditions to enable data sharing and reuse, and to examine and align citations around citation, data sharing, and peer review to encourage replicability and reusability.

Current leading examples of informatics tools that support richer citation, evaluation, open science, and review include:

 

Many Tools, Few Solutions

In this area, there are many useful tools, but few solutions that offer a complete solution – even for a specialized community of practice. All three approaches are useful, and here are several general observations to be made about them. First, tools for replicable research such as VisTrails, MyExperiment, Wings, and StatDocs are characterized by their use of a specific and controlled defined software framework and their ability to facilitate near automatic replication. The complexity of these tools, and their small user and maintenance base means that we cannot rely on them to exist and function in five-ten years – they cannot ensure long term access. Because they focus only on results and not on capturing practices, descriptive metadata and documentation, they allow exact replication without providing the contextual information necessary for broader reuse.  Finally these tools are heterogeneous across subdisciplines, and largely incompatible, they do not as yet offer a broadly scalable solution.

Second, tools and practices for data management have the potential to broadly increase data sharing and the impact of related publications However, although these tools are becoming easier to use, they still require an extra effort for the researcher. Moreover, since additional effort often comes near (or past) the conclusion of the main research project (and only after acceptance of an article and preparation for final publication) it is perceived as a burden, and often honored in the breach.

Third, incentives for replication have been weak in many disciplines – and journals are a key factor. The reluctance of journal editors to publish articles either confirming or non-confirming replications work authors’ incentives to create replicable work.  Lack of formal provenance and attribution practices for data also weakens accountability, raises barriers to conducting replication and reuse, reduces incentive to disseminate data for reuse, and increases the ambiguity of replication studies, making them difficult to study.

Furthermore, new forms of evidence complicate replication and reuse. In most scientific disciplines, the amount of data potentially available for research is increasing non-linearly.  In addition, changes in technology and society are greatly affecting the types and quantities of potential data available for scientific analysis, especially in the social sciences. This presents substantial challenges to the future replicability and reusability of research. Traditional data archives currently consist almost entirely of numeric tabular data from noncommercial sources. New forms of data differ from tabular data in size, format, structure, and complexity. Left in its original form, this sort of data is difficult or for scholars outside of the project that generated it to interpret and use. This is a barrier to integrative and interdisciplinary research, but also a significant obstacle to providing long-term access, which becomes practically impossible as the tacit knowledge necessary to interpret the data is forgotten. To enable broad use and to secure long term access requires more than simply storing the individual bits of information – it requires establishing and disseminating good data management practices. [Altman & King 2007]

How research libraries can jump-start the process.

Many research libraries should consider at least three steps:

1. Create a dataverse hosted by the Harvard Dataverse Network ( http://thedata.harvard.edu/dvn/faces/login/CreatorRequestInfoPage.xhtml ). This provides free, permanent storage, dissemination, with bit-level preservation insured by Harvard’s endowment.  The dataverse can be branded, curated, and controlled by the library – so it enables libraries to maintain relationship with their patrons, and provide curation services, with minimal effort. (And since DVN is open-source, a library can always move from the hosted service to one they run themselves.

2. Link to DMPTool (https://dmp.cdlib.org/) from your libraries website. And consider joining DMPTool as an institution – especially if you use Shibboleth (Internet2) to authorize your users.   You’ll be in good company — according to a recent ARL survey 75% of  ARL libraries are now at least linking to DMPTool.  Increasing researchers use  of DMPtool provides early opportunities for conversation with libraries around data, enables libraries to offer service at a time when it is salient to the researcher , and provides a information which can be used to track and evaluate data management planning needs.

3. Publish a “libguide” focused on helping researchers get more credit for their work.  This is a subject of intense interest, and the library can provide information about trends and tools in the area that researchers (especially junior researchers) of which researchers may not be aware. Some possible topics to include: Data citation (e.g. the  http://www.force11.org/node/4769 ); researcher identifiers (e.g., http://orcid.org ); and impact metrics (http://libraries.mit.edu/scholarly/publishing/impact) .

 

 

 

References

 

Altman, M., L. Andreev, M.  Diggory, M. Krot, G. King, D. Kiskis, A. Sone, S. Verba,  A Digital Library for the Dissemination and Replication of Quantitative Social Science Research, Social Science Computer Review 19(4):458-71. 2001.

Altman, M. and G. King. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib Magazine 13(3/4).  2007.

Anderson, R.  W. H. Greene, B. D. McCullough and H. D. Vinod. “The Role of Data/Code Archives in the Future of Economic Research,” Journal of Economic Methodology. 2007.

Buckheit, J. and D.L. Donoho,Wavelan and Reproducible Research, in A. Antoniadis (ed.) Wavelets and Statistics, Springer-Verlag. 1995.

Crosas, M., The Dataverse Network®: An Open-Source Application for Sharing, Discovering and Preserving Data, D-lib Magazine 17(1/2). 2011.

D.S. Hamermesh, “Viewpoint: Replication in Economics,” Canadian Journal of Economics. 2007.

Deelman, E. Y. Gil, (Eds.). Final Report on Workshop on the Challenges of Scientific Workflows.  2006. <http://vtcpc.isi.edu/wiki/images/b/bf/NSFWorkflow-Final.pdf>

Freire, J., C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T. Vo. Managing rapidly-evolving scientific workflows. In International Provenance and Annotation Workshop (IPAW), LNCS 4145, 10-18, 2006.

Gentleman R., R. Temple Lang. Statistical Analyses and Reproducible Research,  Journal of Computational and Graphical Statistics 16(1): 1-23. 2007.

Gutmann M., M. Abrahamson, M. Adams, M. Altman, C.  Arms,  K.  Bollen, M. Carlson, J. Crabtree, D. Donakowski, G. King, J. Lyle, M. Maynard, A. Pienta, R. Rockwell, L. Timms-Ferrara, C. Young,  “From Preserving the Past to Preserving the Future: The Data-PASS Project and the challenges of preserving digital social science data”, Library Trends 57(3):315-337. 2009.

Hedstrom, Margaret, Jinfang Niu, Kaye Marz,. “Incentives for Data Producers to Create “Archive/Ready” Data: Implications for Archives and Records Management”, Proceedings of the Society of American Archivists Research Forum. 2008.

King, G. “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing.” Sociological Methods and Research, 32(2), 173–199. 2007.

Knuth, D.E., Literate  Programming, CLSI Lecture Notes 27. Center for the Study of Language and Information.  Stanford, Ca. 1992.

Leisch F., and A.J. Rossini, Reproducible Statistical Research, Chance 16(2): 46-50. 2003.

McCullough, B.D., Open Access Economics Journals and the Market for Reproducible Economic Research, Economic Analysis & Policy 39(1). 2009.

Pienta, A., LEADS Database Identifies At-Risk Legacy Studies, ICPSR Bulletin 27(1) 2006.

Schwab, M., M. Karrenbach, and J. Claerbout, Making Scientific Computations Reproducible, Computing in Science and Engineering 2: 61-67. 2000.

Stodden, V.The Legal Framework for Reproducible Scientific Research: Licensing and Copyright, Computing in Science and Engineering 11(1):35-40. 2009.

 

 

[1] Also see for example the CRAN reproducible research task view: http://cran.r-project.org/web/views/ReproducibleResearch.html;  and the Reproducible Research tools page: http://reproducibleresearch.net/index.php/RR_links#Tools


Pages