Aug 05, 8:21pm

Margaret Purdy is a Graduate Research Intern in the Program on Information Science, researching the area of library privacy.


Building Trust: A Primer on Privacy for Librarians

Privacy Protections Build Mutual Trust Between Patrons and Librarians

Librarians have accepted privacy as a central tenet of their professional ethics and responsibilities for nearly eight decades. However, by 2017, privacy as a human right has been simultaneously strengthened and reaffirmed, defended and rebuffed, but rarely do we as librarians take the time to step away and ask why privacy truly matters and what we can do to protect it.

The American Library Association and the International Federation of Library Associations have both asserted that the patrons have the right to privacy while seeking information.1 The ALA in particular brings up the notion of privacy allowing for intellectual freedom – the ability to consume information and know they will not face repercussions such as punishment or judgments based on what they read. Librarians are in the business of disseminating information in order to stimulate knowledge growth. One major stimulus for such growth is the mutual trust between the library and the patron – trust that the patron will not use the knowledge in a destructive way, and trust that the library will not judge the patron for information interests. Ensuring patron privacy is one way for the library to prove that trust. Similarly, the IFLA2 emphasizes the right to privacy in its ethics documentation. In addition to the rights of patron privacy that the ALA ensures, the IFLA also allows for as much transparency as possible into “public bodies, private sector companies and all other institutions whose activities effect [sic] the lives of individuals and society as a whole.” This is yet another way to establish trust between the library and its patrons, ultimately ensuring intellectual freedom and growth of knowledge.

Globally, internet privacy and surveillance are also matters that are currently getting much more notice and debate, and government regulations, such as the EU General Protection of Public Data (GDPR)3, are working to strengthen individuals’ abilities to control their own data and ensure it does not end up being used against them. The GDPR is slated to go into effect in 2018 and will broadly protect the data privacy rights of EU citizens. It will certainly be a policy to watch, especially as a litmus for how effective major legislation can be in asserting privacy protections. Even more practically, however, is that the GDPR protects EU citizens even if the one collecting data is outside the EU. This will potentially affect many libraries across the United States and the world at large, as there is an added level of awareness required to ensure that any collaboration with or service to EU citizens is properly protected.

Libraries Face a Double-Barreled Threat from Government Surveillance and Corporate Tracking

In addition to the ALA and IFLA codes of ethics that ensure librarians work to ensure patrons’ rights to privacy, multiple governmental codes deal with the right to information privacy. In the United States, the fourth amendment protects the right to remain free from searches and seizures, and has often been cited as a protection of privacy. Similarly, federal legislation such as FERPA, which protects the privacy rights of students, and HIPAA, which protects medical records have reasserted that privacy is a vital right. Essentially every US state also has some provisions about privacy, many of which directly relate to the right to privacy in library records.4

However, in recent years, many of the federal government’s protections have begun to slip away. Immediately after 9/11, the USA PATRIOT Act passed, allowing the government much broader abilities to track patron library records. More recently, as digital information became easier to track, programs such as PRISM and other governmental tracking arose. Both of these government programs directly threaten the ability for library patrons to conduct research, information-seeking, and more in privacy.

Businesses have also learned ways of tracking their users’ behaviors online, and using that data for practices such as targeted advertising. While the vast majority of this data is encrypted and could not be easily brought back to personally-identifiable information, it is still personal data that is not necessarily kept in the most secure way possible. And while breaches do happen, even without them, it is not out of the question for an experienced party to be able to reconstruct an individual from the data collected, and to know not only that individual’s browsing history and location, but also potentially information such as health conditions, bank details, or other sensitive information.

While this information is often used for simple outreach, including Customer Relationship Marketing, where a company will recommend new products based on previous purchases, it can also be used in more invasive ways. In 2012, Target sent out a promotional mailing containing deals on baby products to a teenage girl.5 Based on their data they had tracked about her purchases, the algorithm had determined, correctly, that she was highly likely to be pregnant. While this story received extensive media attention, businesses of all types, including retailers, hotels, and even healthcare systems participate in similar practices, using data to personalize the experience. However, when stored irresponsibly, this data can lead to unintentional and unwanted sharing of information – potentially including embarrassing web browsing or shopping habits, dates that homes will be empty for thieves, medical conditions that could increase insurance rates, and more

Growing Public Concern

One of the most pressing risks to privacy protections currently is user behavior and expectations. With the information industry becoming much more digital, information is becoming easier to access, spread, and consume. However, the tradeoff is that users, and the information they view, is much easier to track, by both corporate and government entities, friendly or malicious. Plus, because much of the tracking and surrendering of privacy, including the ability to save passwords, CRM, targeted algorithms, and more, make it more convenient to browse the internet, many patrons willingly give up the right to privacy in favor of convenience.

A recent poll6 showed that between 70% and 80% of internet users are aware that practices such as saving passwords, agreeing to privacy policies and terms of use without reading them, and accepting free information in exchange for advertising or data surrendering is a risk to privacy. However, a large majority of users still participate in those practices. There are several theories as to what causes users to agree to forgo privacy, including the idea that the accepting the risks make browsing the internet much more convenient, and users are hesitant to give up that convenience. Another theory is that there really is no alternative to accepting the risks. Many sites will not allow use without acceptance of the terms of use and/or privacy policy. A 2008 study7 calculated how much time users would spend reading privacy policies were they to actually read all of them, and found that, on average, user would spend nearly two weeks a year just reading policies, not to mention the time taken to fully understand the legalese and complicated implications.

Another similar poll8 shows that more than half of Americans are concerned about privacy risks, and over 80% have taken some precautionary action. However, most of that 80% are unaware of more that they can do to protect themselves. This is true for both government surveillance and corporate tracking. The public has similar levels of awareness and concern about both, but are unaware of how to better protect themselves, and thus, are more likely to allow it to happen.

Best Practices for Librarians


Given the increasing public concern and awareness, as well as the longstanding history of librarians’ focus on privacy, librarians have a perfect opportunity to intervene and re-establish the trust from users that their information will not be shared and to meet the professional ethical model of always protecting privacy. There are nearly endless resources that can outline in great detail what librarians should do to defend their patrons against attacks on privacy, whether that comes from government surveillance or corporate tracking. Some of these involve systematic evaluations of all touchpoints in the library and recommendations for implementing best practices. These exist even for areas that do not seem like obvious ways for privacy to be violated, such as anti-theft surveillance on surrounding buildings, or through third-party content vendors.

By dedicating library resources to systematically check for privacy practices, librarians can take some of the burden of inconvenience off of the individual patron. Many of these best practices involve taking the time to change computer settings, read and understand privacy policies, and negotiate with vendors, which few, if any, individuals would do on their own. With the muscle of the library working on it, though, the patrons will still benefit, without needing to dedicate the same amount of time. This serves a dual function as well, as in addition to actual steps to protect patrons, librarians can also serve as an educational resource to help patrons learn simple steps to take to protect their personal systems.

Some examples of protectionary moves are to create policies on library computers that ensure that as little information from user sessions is saved. There are several incredibly simple steps that, while they reduce the convenience slightly, ensure users a safe and private experience. This includes, settings that clear cookies, the cache, and user details after each session (also known as “incognito mode”); or the clearing of patron checkout records once the book is returned.

In addition to those tweaks, the ALA and LITA offer checklists of privacy best practices to systematically implement in libraries. These cover everything from data exchanges, OPACs and patron borrowing records, protection for children, and more in great detail. NISO also provides overarching design principles for approaching library privacy in a digital age. Additionally, there are recommended security audits, many of which Bruce Shuman mentions in his book, Library Security and Safety Handbook: Prevention, Policies, and Procedures.

Additionally, the library, already known for educational programs and community-oriented programming could serve as a location to educate the public about the real risks of tracking and surveillance. There is a definite gap between the public’s awareness of the risks and the public’s action to mitigate those risks. While librarians cannot force behavior, and most would not want to, offering patrons trustworthy information about the risks and how to avoid them in their personal browsing experiences helps re-establish privacy as a core value and gives patrons a reason to trust the library. This recent post from Nate Lord at Digital Guardian offers simple and more in depth steps that patrons can take to ensure their digital information is secure. If a library offered some of these in a training course or as a takeaway, it could serve as a valuable resource in narrowing the gap between patron awareness and activity.

Ultimately, privacy is often one of those words that many people give lip service to, but without fully understanding the risks and consequences, the motivation to give up convenience in order to protect privacy is not always there. However, we as librarians, who value privacy as one of the professions’ core tenets have a real opportunity to help protect patrons’ data against these threats. Resources, such as the aforementioned privacy checklists and audit guides, exist to help librarians ensure their library is in compliance with the current best practices. The threats against privacy are growing, and librarians are well-suited to intervene and ensure patron protection.

Recommended Resources



1. ALA Code of Ethics. (1939).

2. IFLA Code of Ethics.

3. GDPR Portal (2016).

4. Adams, H. et. al. (2005). Privacy in the 21st century. Westport, Conn.: Libraries Unlimited.

5. Hill, K. (2012). How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did.

6. Ayala, D. (2017). Security and Privacy for Libraries in 2017. Online Searcher, 41(3).

7. Cranor, L. (2008). The Cost of Reading Privacy Policies. I/S: A Journal Of Law And Policy For The Information Society.

8. Rainie, L., & Rainie, L. (2017). The state of privacy in post-Snowden America. Pew Research Center.

Jul 17, 10:16pm

Alex Chassanoff  is a Postdoctoral Fellow in the Program on Information Science and continues a series of posts on software curation.

As I described in my first post, an initial challenge at MIT Libraries was to align our research questions with the long-term collecting goals of the institution. As it happens, MIT Libraries had spent the last year working on a task force report to begin to formulate answers to just these sorts of questions. In short, the task force envisions MIT Libraries as a global platform for scholarly knowledge discovery, acquisition, and use. Such goals may at first appear lofty. However, the acquisition of knowledge through public access to resources has been a central organizing principle of libraries since their inception. In his opening statement at the first national conference of librarians in 1853, Charles Coffin Jewett proclaimed, “We meet to provide for the diffusion of a knowledge of good books and for enlarging the means of public access to them. [1]

Archivists and professionals working in special collections have long been focused on providing access to, and preservation of, local resources at their institutions. What is perhaps most unique about the past decade is the broadened institutional focus on locally-created content. This shift in perspective towards looking inwards is a trend noted by Lorcan Dempsey, who describes it thusly:

In the inside-out model, by contrast, the university, and the library, supports resources which may be unique to an institution, and the audience is both local and external. The institution’s unique intellectual products include archives and special collections, or newly generated research and learning materials (e-prints, research data, courseware, digital scholarly resources, etc.), or such things as expertise or researcher profiles. Often, the goal is to share these materials with potential users outside the institution.[2]

Arguably, this shift in emphasis can be attributed to the affordances of the contemporary networked research environment, which has broadened access to both resources and tools. Archival collections previously considered “hidden” have been made more accessible for historical research through digitization. Scholars are also able to ask new kinds of historical questions using aggregate data, and answer historical questions in new kinds of ways.

This begs the question – what unique and/or interesting content does an institution with a rich history of technology and innovation already have in our possession?

Exploring Software in MIT Collections

As a research institution, MIT has played a fundamental role in the development and history of computing. Since the 1940s, the Institute has excelled in the creation and production of software and software-based artifacts. Project Whirlwind, Sketchpad, and Project MAC are just a few of the monumental research computing projects conducted here. As such, the Institute Archives & Special Collections has over time acquired a significant number of materials related to software developed at MIT.

In our quest to understand how software may be used (and made useful) as an institutional asset, we engaged in a two-pronged approach. First, we aimed to identify the types of software that MIT might provide access to.  Second, we aimed to understand more about the active practices of researchers creating, using, and/or reusing software. What function or purpose was software being created, used, and/or reused for? We thought that framing our research in this way might help us develop a robust understanding of both existing practices and potential user needs. At the same time, we also recognized that identifying and exposing potential "pain points" could guide and inform future curation strategies. After an initial period of exploratory work, we identified representative software cases found in various pockets across the MIT campus.

Collection #1: The JCR Licklider Papers and the GRAPPLE software

Materials in The JCR Licklider Papers were first acquired by the Institute for Special Archives and Collections in 1996. Licklider was a psychologist and renowned computer scientist who came to MIT in 1950. He is widely hailed as an influential figure for his visionary ideas around personal computing and human-computer interaction.

In my exploration of archival materials, I looked specifically at boxes 13-18 in the collection, which contained documentation about GRAPPLE, a dynamic graphical programming system developed while Licklider was at the MIT Laboratory for Computer Science. According to the user manual, the focus of GRAPPLE on “the development of a graphical form of a language that already exists as a symbolic programming language.” [3] Programs could be written using computer-generated icons and then monitored by an interpreter.


Figure 1. Folder view, box 16, J.C.R. Licklider Papers, 1938-1995 (MC 499),Institute Archives and Special Collections, MIT Libraries, Cambridge, Massachusetts.


Materials in the collection related to GRAPPLE include:

  • Printouts of GRAPPLE source code
  • GRAPPLE program description
  • GRAPPLE interim user manual
  • GRAPPLE user manual
  • GRAPPLE final technical report
  • Undated and unidentified computer tapes
  • Assorted correspondence between Licklider and the Department of Defense

Each of the documents has multiple versions included in the collection, typically distinguished by date and filename (where visible). The printouts of GRAPPLE source code totaled around forty pages. The computer tapes have not yet been formatted for access.

While the software may be cumbersome to access on existing media, the materials in the collection contain substantial amounts of useful information about the function and nature of software in the early 1980s. Considering the documentation related to GRAPPLE in different social contexts helped to illuminate the value of the collection in relationship to the history of early personal computing.

Historians of programming languages would likely be interested in studying the evolution of the coding syntax contained in the collection. The GRAPPLE team used the now-defunct programming language MDL (which stands for “More Datatypes than Lisp”); the extensive documentation provides examples of MDL “in action” through printouts of code packages.


                            Figure 2. Computer file printout, “eraser.mud.1”, 31 May 1983, box 14, J.C.R. Licklider Papers, 1938-1995 (MC 499), Institute Archives and Special Collections, MIT Libraries, Cambridge, Massachusetts.


The challenges facing the GRAPPLE team at the time of coding and development would be be interesting to revisit today. One obstacle to successful implementation noted by the team were the existing limitations of graphical display environments. In their final technical report on the project from 1984, the GRAPPLE team describes the potential of desktop icons for identifying objects and their representational qualities.

Our conclusion is that icons have very significant potential advantages over symbols but that a large investment in learning is required of each person who would try to exploit the advantages fully. As a practical matter, symbols that people already know are going to win out in the short term over icons that people have to learn in applications that require more than a few hundred identifiers. Eventually, new generations of users will come along and learn iconic languages instead of or in addition to symbolic languages, and the intrinsic advantages of icons as identifiers (including even dynamic or kinematic icons) will be exploited. [4]

Some fundamental dynamics in the study of human-computer interaction remain relatively unchanged despite advances in technology; namely, the powerful relationship between representational symbols and the production of knowledge/knowledge structures.  What might it look like to bring to life today software that was conceived in the early days of personal computing? Such aspirations are certainly possible. Consider the journey of the Apollo 11 source code, which was transcribed from digitized code printouts and then put onto Github. One can even simulate the Apollo missions using a virtual Apollo Guidance Control (AGC).

Other collection materials also offer interesting documentation of early conceptions of personal computing while also providing clear evidence that computer scientists such as Licklider regarded abstraction as an essential part of successful computer design. A pamphlet entitled “User Friendliness–And All That”notes the “problem” of mediating between “immediate end users” and “professional computer people” to successfully aid in a “reductionist understanding of computers.”

                                                      Figure 3. Pamphlet, “User friendliness-And All That”, undated, box 16, J.C.R. Licklider Papers, 1938-1995 (MC 499), Institute Archives and Special Collections, MIT Libraries, Cambridge, Massachusetts.

These descriptions are useful for illuminating how software was conceived and designed to be a functional abstraction. Such revelations may be particularly relevant in the current climate – where debates over algorithmic decision making are rampant. As the new media scholar Wendy Chun asks, “What is software if not the very effort of making something intangible visible, while at the same rendering the visible (such as the machine) invisible?” [5]



Building capacity for collecting software as an institutional asset is difficult work. Expanding collecting strategies presents conceptual, social, and technical challenges that crystallize once scenarios for access and use are envisioned. For example, when is software considered an artifact ready to be “archived and made preservable”? What about research software developed and continually modified over the years in the course of ongoing departmental work? What about printouts of source code – is that software? How do code repositories like github fit into the picture? Should software only be considered as such its active state of execution? Interesting ontological questions surface when we consider the boundaries of software as a collection object.

Archivists and research libraries are poised to meet the challenges of collecting software. By exploring what makes software useful and meaningful in different contexts, we can more fully envision potential future access and use scenarios. Effectively characterizing software in its dual role as both artifact and active producer of artifacts remains an essential piece of understanding its complex value.



[1] “Opening Address of the President.” Norton’s Literary Register And Book Buyers Almanac, Volume 2. New York: Charles B. Norton, 1854.

[2] Dempsey, Lorcan. “Library Collections in the Life of the User: Two Directions.” LIBER Quarterly 26, no. 4 (2016): 338–359. doi:

[3]  GRAPPLE Interim User Manual, 11 October 1981, box 14, J.C.R. Licklider Papers, 1938-1995 (MC 499), Institute Archives and Special Collections, MIT Libraries, Cambridge, Massachusetts.

[4] Licklider, J.C.R. Graphical Programming and Monitoring Final Technical Report, U.S. Government Printing Office, 1988, 17.

[5] Chun, Wendy Hui Kyong. “On Software, or the Persistence of Visual Knowledge.” Grey Room 18 (Winter 2004): 26-51.

Jun 27, 10:11am

Matt Bernhardt is a web developer in the MIT libraries and a collaborator in our program. He presented this talk, entitled Reality Bytes – Utilizing VR and AR in The Library Space, as part of the Program on Information Science Brown Bag Series.

In his talk, illustrated by the slides below, Bernhardt reviews technologies newly available to libraries that enhance the human-computing interface:

Bernhardt abstracted his talk as follows:

Terms like “virtual reality” and “augmented reality” have existed for a long time. In recent years, thanks to products like Google Cardboard and games like Pokemon Go, an increasing number of people have gained first-hand experience with these once-exotic technologies. The MIT Libraries are no exception to this trend. The Program on Information Science has conducted enough experimentation that we would like to share what we have learned, and solicit ideas for further investigation.

Several themes run through Matt’s talk:

  • VR should be thought of broadly as an engrossing representation of physically mediated space. Such a definition encompasses not only VR, AR and ‘mixed-’ reality — but also virtual worlds like Second Life, and a range of games from first-person-shooters (e.g. Halo) to textual games that simulate physical space (e.g. “Zork”).
  • A variety of new technologies are now available at a price-point that is accessible for libraries and experimentation — including tools for rich information visualization (e.g. stereoscopic headsets), physical interactions (e.g. body-in-space tracking), and environmental sensing/scanning (e.g. Sense).
  • To avoid getting lost in technical choices, consider the ways in which technologies have the potential to enhance the user-interface experience, and the circumstances in which the costs and barriers to use are justified by potential gains. For example, expensive, bulky VR platforms may be most useful to simulate experiences that would in real life be expensive, dangerous, rare, or impossible.

A substantial part of the research agenda of the Program on Information Science is focused on developing theory and practice to make information discovery and use more inclusive and accessible to all. From my perspective, the talk above naturally raises questions about how the affordances of these new technologies may be applied in libraries to increase inclusion and access: How could VR-induced immersion be used to increase engagement and attention by conveying the sense of place of being in an historic archive? How could realistic avatars be used to enhance social communication, and lower the barriers to those seeking library instruction and reference? How could physical mechanisms for navigating information spaces, such as eye tracking, support seamless interaction with library collections, and enhance discovery?

For those interested in these and other topics, you may wish to read some of the blog posts and reports we have published in these areas. Further, we welcome collaboration from library staff and researchers who are interested in collaborating in research and practice. To support collaboration we offer access to fabrication, interface, and visualization technology through our lab.

Jun 06, 1:09pm

Catherine D’Ignazio is an Assistant Professor of Civic Media and Data Visualization at Emerson College, a principal investigator at the Engagement Lab, and a research affiliate at the MIT Media Lab/Center for Civic Media. She presented this talk, entitled, Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots as part of Program on Information Science Brown Bag Series.

In her talk, illustrated by the slides below, D’Ignazio points to the gap between those people that collect and use data, and those people who are the subject of data collection.

D’Ignazio abstracted her talk as follows:

Communities, governments, libraries and organizations are swimming in data—demographic data, participation data, government data, social media data—but very few understand what to do with it. Though governments and foundations are creating open data portals and corporations are creating APIs, these rarely focus on use, usability, building community or creating impact. So although there is an explosion of data, there is a significant lag in data literacy at the scale of communities and citizens. This creates a situation of data-haves and have-nots which is troubling for an open data movement that seeks to empower people with data. But there are emerging technocultural practices that combine participation, creativity, and context to connect data to everyday life. These include data journalism, citizen science, emerging forms for documenting and publishing metadata, novel public engagement in government processes, and participatory data art. This talk surveys these practices both lovingly and critically, including their aspirations and the challenges they face in creating citizens that are truly empowered with data.

In her talk, D’Ignazio makes five recommendations on how to help people learn data literacy:

  • Many tutorials on data use abstract or standardized examples examining cars (or widgets) — this does not connect with most audiences. Ground your curriculum in community-centered problems and examples.
  • Frequently, people encounter data “in the wild” without metadata or other context that are needed for constructing meaning with it. To address this, have learners create data biographies — which explain who and how the data was collected and used, and its purposes, impacts and limitations.
  • Data is messy, and learners should not always be introduced to it through a clean, static data set but through encountering the complex process of collection.
  • Design tools that are learner-centric: focused, guided, inviting, and expandable.
  • People like monsters better than they like bar charts — so favor creative community-centered outputs over abstract purity.

Much more detail on these recommendation can be found in D’Ignazio’s professional writings.

D’Ignazio’s talk illustrated two more general tensions. One general tension is between a narrow conception of data literacy as coding, spreadsheets, statistics; and a broader conception that is not yet crisply defined but is distinct from statistical-, information-, IT-, media-, and Visual- literacies. This resonates with work done by our program’s research intern Zach Lizee on Digital Literacy and Digital Citizenship in which he argues for a form of literacy that prepares learners to engage with the evolving role of information in the world, and to use that engagement to advocate policy and standards that enact their values.

D’Ignazio’s talks also highlights a broad general tension that currently exists between the aspiration of open data and data journalism to empower the broader public, and the structural inequalities in our systems of data collection, sharing, analysis, and meaning-making. This tension is very much in play with respect to Libraries and Universities approaches to open access.

Much of academia, and many policy-makers have embraced the potential value of Open Access to content. The MIT libraries’ vision also embraces the challenge of building an open source platform to enable global discovery and access to this content. Following the themes of D’Ignazio’s talk and based on our research, I conjecture that library open platforms could be of tremendous worth — but not for the reasons one usually expects.

The worth of software, and of information and communication technology systems and platforms generally, is typically measured by how much it is used, what functions it provides, and what content/data it enables one to use. However the importance of Library participation in the development of open information platforms goes beyond this. Libraries have not distinguished themselves from the Googles, Twitters and Facebooks of the world in making open content discoverable, or in the functionality that their platforms provide to create, annotate, share, and make meaning from this content: The commercial sector has both the capacity and the incentives to do this — as it’s profitable.

The worth of a library open platform is in the core library values that it enacts: broad inclusion/participation, long-term (intergenerational) persistence, transparency, and privacy. These are not values that current commercial platforms support — because the commercial sector lacks an incentives to create these. To go beyond open access to equity in participation in the creation and understanding of knowledge, libraries, museums, archives, and others that share thesevalues must lead in creating open platforms.

Reflecting the themes of D’Ignazio’s talk, the research we conduct here, in the Program on Information Science, engages with the expanding scope of information literacy, and with inequalities in access to information. For those interested in these and other projects, we have published blog posts and reports in these areas.

May 01, 11:09am

Professor Laura Hosman, who is Assistant Professor at Arizona State University (with a joint appointment in the School for the Future of Innovation in Society and in The Polytechnic School) gave this talk Becoming a Practitioner Scholar in Technology for Development as part of the Program on Information Science Brown Bag Series.

In her talk, illustrated by the slides below, Hosman argues that, for a large part of the world “the library of the future” will be based on cellphones, intranets, and digital-but-offline content.



Hosman abstracted her talk as follows:

Access to high-quality, relevant information is absolutely foundational for a quality education. Yet, so many schools across the developing world lack fundamental resources, like textbooks, libraries, electricity, and Internet connectivity. The SolarSPELL (Solar Powered Educational Learning Library) is designed specifically to address these infrastructural challenges, by bringing relevant, digital educational content to offline, off-grid locations. SolarSPELL is a portable, ruggedized, solar-powered digital library that broadcasts a webpage with open-access educational content over an offline WiFi hotspot, content that is curated for a particular audience in a specified locality—in this case, for schoolchildren and teachers in remote locations. It is a hands-on, iteratively developed project that has involved undergraduate students in all facets and at every stage of development. This talk will examine the design, development, and deployment of a for-the-field technology that looks simple but has a quite complex background.

In her talk, Hosman describes how the inspiration for her current line of research and practice started when she received a request to aid deployment of the One Laptop Per Child project in Haiti. The original project had allocated twenty-five million dollars to laptop purchasing, but failed to note that electric power was not available in many of the areas they needed to reach — so they asked for Professor Hosman’s help in finding an alternative power source. Over the course of her work, the focus of her interventions has shifted from solar power systems, to portable computer labs, to portable libraries — and she noted that every successful approach involved evolution and iteration.

Hosman observes that for much of the world’s populations electricity is a missing prerequisite to computing and to connectivity. She also notes that access to computing for most of the world comes through cell phones, not laptops. (And she recalls even finding that the inhabitants of remote islands occasionally had better cellphones than she carried.) Her talk notes that there are over seven billion cell phones in the world — which is over three times the number of computers worldwide, and many thousands of times the number of libraries.

Hosman originally titled her talk The Solar Powered Educational Learning Library – Experiential Learning And Iterative Development. The talk’s new title reflects one of three core themes that ran through the talk — the importance of people. Hosman argues that technology is never by itself sufficient (there is no “magic bullet”) — to improve people’s lives, we need to understand and engineer for people’s engagement with technology.

The SolarSPELL project has engaged with people in surprising ways. Not only is it designed around the needs of the target clients, but it has continuously involved Laura’s engineering students in its design and improvement; and has further involved high-school students in construction. Under Hosman’s direction, university and high school students worked together to construct a hundred SolarSPELL’s using mainly parts ordered from amazon. Moreover, Peace Corps volunteers are a critical part of the project. The people in the Corps provide the grass-roots connections that spark people to initially try the SolarSPELL, and provide a persistent human connection that supports continuing engagement.

A second theme of the talk is the importance of open and curated content. Simply making a collection freely available on-line is not enough, when we want most people in the world to be able to access it. For collections to be meaningfully accessible they need to available for bulk download; they need to be usable under an open license; they need to be selected for a community of use that does not have the option of seeking more content online; and they need to contain all of the context needed for that community to understand them.

A final theme that Hosman stresses is that any individual (scholar, practitioner, actor) will never have all the skills needed to address complex problems in the complex real world — solving real world problems requires a multidisciplinary approach. SolarSPELL demonstrates this through combining expertise in electrical engineering, content curation, libraries, software development, education, and in the sociology and politics of the region. Notably, the ASU libraries have been a valuable partner in the SolarSPELL project, and have even participated in fieldwork. Much more information about this work and its impact can be found in Hosman’s scholarly papers.

The MIT libraries have embraced a vision of serving a global community of scholars and learners. Hosman’s work demonstrates the existence of large communities of learners that would benefit from open educational and research materials — but whose technology needs are not met by most current information platforms (even open ones). Our aim is that future platforms not only enable research and educational content to reach such communities, but also that local communities worldwide can contribute their local knowledge, perspective, and commentary to the world’s library.

Surprisingly, the digital preservation research conducted at the libraries is of particular relevance to tackling these challenges. The goal of digital preservation can be thought of as communicating with the future — and in order to accomplish this, we need to be able to capture both content and context, steward it over time (managing provenance, versions, and authenticity), and prepare it to be accessed through communication systems and technologies that do not yet exist. A corollary is that properly curated content should be readily capable of being stored and delivered offline — which is currently a major challenge for access by the broader community.

Reflecting the themes of Hosman’s talk, the research we conduct here, in the Program on Information Science, is fundamentally interdisciplinary: For example our research in information privacy has involved librarians, computer scientists, statisticians, legal scholars, and many others. Our Program also aims to bridge research and practice, support translational and applied research, which often requires sustained engagement with grassroots stakeholders. For example, the success of the DIY redistricting (aka. “participative GIS”) efforts in which we’ve collaborate relied on sustained engagement with grassroots good-government organizations (such as Common Cause and League of Women Voters); students; and the media. For those interested in these and other projects, we have published reports and articles describing them.

Apr 21, 10:13am

Alex Chassanoff  who is a Postdoctoral Fellow in the program on information science continues a series of posts on software curation.

“Curation as Context:

Software in the Stacks”

As scholarly landscapes shift, differing definitions for similar activities may emerge from different communities of practice.   As I mentioned in my previous blog post, there are many distinct terms for (and perspectives on) curating digital content depending on the setting and whom you ask [1].  Documenting and discussing these semantic differences can play an important role in crystallizing shared, meaningful understandings.  

In the academic research library world,  the so-called data deluge has presented library and information professionals with an opportunity to assist scholars in the active management of their digital content [2].  Curating research output as institutional content is a relatively young, though growing phenomenon.  Research data management (RDM) groups and services are increasingly common in research libraries, partially fueled by changes in federal funding grant application requirements to encourage data management planning.  In fact, according to a recent content analysis of academic library websites, 185 libraries are now offering RDM services [3].  The charge for RDM groups can vary widely; tasks can range from advising faculty on issues related to privacy and confidentiality, to instructing students on potential avenues for publishing open-access research data.

As these types of services increase, many research libraries are looking to life cycle models as foundations for crafting curation strategies for digital content [4].  On the one hand, life cycle models recognize the importance of continuous care and necessary interventions that managing such content requires.  Life cycle models also provide a simplified view of essential stages and practices, focusing attention on how data flows through a continuum.  At the same time, the data flow perspective can obscure both the messiness of the research process and the complexities of managing dynamic digital content [5,6].  What strategies for curation can best address scenarios where digital content is touched at multiple times by multiple entities for multiple purposes?  

Christine Borgman notes the multifaceted role that data can play in the digital scholarship ecosystem, serving a variety of functions and purposes for different audiences.  Describing the most salient characteristics of that data may or may not serve the needs of future use and/or reuse. She writes:

These technical descriptions of “data” obscure the social context in which data exist, however. Observations that are research  findings  for  one  scientist  may  be background context to another. Data that are adequate evidence for one purpose (e.g., determining whether water quality is safe for surfing) are inadequate for others (e.g., government standards for testing drinking water). Similarly, data that are synthesized for one purpose may be “raw” for another. [7]

Particular data sets may be used and then reused for entirely different intentions.  In fact, enabling reuse is a hallmark objective for many current initiatives in libraries/archives.  While forecasting future use is beyond our scope, understanding more about how digital content is created and used in the wider scholarly ecosystem can prove useful for anticipating future needs.  As Henry Lowood argues, “How researchers will actually put their hands and eyes on historical software and data collections generally has been bracketed out of data curation models focused on preservation”[8].  

As an example, consider the research practices and output of faculty member Alice, who produces research tools and methodologies for data analysis. If we were to document the components used and/or created by Alice for this particular research project, it might include the following:


  • Software program(s) for computing published results
  • Dependencies for software program(s) for replicating published results
  • Primary data collected and used in analysis
  • Secondary data collected and used in analysis
  • Data result(s) produced by analysis
  • Published journal article


We can envision at least two uses of this particular instantiation of scholarly output. First, the statistical results of the data can be verified by replicating the conditions of the analysis.   Second, the statistical approach executed by the software program can be executed on a new inputted data set.  In this way, software can simultaneously serve as both an outcome to be preserved and as a methodological means to an (new) end.  

There are certain affordances in thinking about strategies for curation-as-context, outside the life cycle perspective.  Rather than emphasizing content as an outcome to be made accessible and preserved through a particular workflow, curation could instead aim to encompass the characterization of well-formed research objects, with an emphasis on understanding the conditions of their creation, production, use, and reuse.   Recalling our description of Alice above, we can see how each component of the process can be brought together to represent an instantiation of a contextually-rich research object.

Curation-as-context approaches can help us map the always-already in flux terrain of dynamic digital content.  In thinking about curating software as a complex object for access, use, and future use, we can imagine how mapping the existing functions, purposes, relationships, and content flows of software within the larger digital scholarship ecosystem may help us anticipate future use, while documenting contemporary use.  As Cal Lee writes:

Relationships to other digital objects can dramatically affect the ways in which digital objects have been perceived and experienced. In order for a future user to make sense of a digital object, it could be useful for that user to know precisely what set of surrogate representations – e.g. titles, tags, captions, annotations, image thumbnails, video keyframes – were associated with a digital object at a given point in time. It can also be important for a future user to know the constraints and requirements for creation of such surrogates within a given system (e.g. whether tagging was required, allowed, or unsupported; how thumbnails and keyframes were generated), in order to understand the expression, use and perception of an object at a given point in time [9].

Going back to our previous blog post, we can see how questions like “How are researchers creating and managing their digital content” are essential counterparts to questions like “What do individuals served by the MIT Libraries need to able to reuse software?” Our project aims to produce software curation strategies at MIT Libraries that embrace Reagan Moore’s theoretical view of digital preservation, whereby “information generated in the past is sent into the future” [10].  In other words, what can we learn about software today that makes an essential contribution to meaningful access and use tomorrow?  

Works Cited

[1] Palmer, C., Weber, N., Muñoz, T, and Renar, A. (2013), “Foundations of data curation: The pedagogy and practice of ‘purposeful work’ with research data”, Archives Journal, Vol 3.

[2] Hey, T.  and Trefethen, A. (2008), “E-science, cyberinfrastructure, and scholarly communication”, in Olson, G.M. Zimmerman, A., and Bos, N. (Eds), Scientific Collaboration on the Internet, MIT Press, Cambridge, MA.

[3] Yoon, A. and Schultz, T. (2017), “Research data management services in academic libraries in the US: A content analysis of libraries’ websites” (in press). College and Research Libraries.

[4] Ray, J. (2014), Research Data Management: Practical Strategies for Information Professionals, Purdue University Press, West Lafayette, IN.

[5] Carlson, J. (2014), “The use of lifecycle models in developing and supporting data services”, in Ray, J. (Ed),  Research Data Management: Practical Strategies for Information Professionals, Purdue University Press, West Lafayette, IN.

[6] Ball, A. (2010), “Review of the state of the art of the digital curation of research data”, University of Bath.

[7] Borgman, C., Wallis, J. and Enyedy, N. (2006), “Little science confronts the data deluge: Habitat ecology, embedded sensor networks, and digital libraries”, Center for Embedded Network Sensing, 7(1–2), 17 – 30. doi: 10.1007/s00799-007-0022-9. UCLA: Center for Embedded Network Sensing.  

[8] Lowood, H. (2013), “The lures of software preservation”, Preserving.exe: Towards a national strategy for software preservation, National Digital Information Infrastructure and Preservation Program of the Library of Congress.

[9] Lee, C. (2011), “A framework for contextual information in digital collections”, Journal of Documentation 67(1).

[10] Moore, R. (2008), “Towards a theory of digital preservation”, International Journal of Digital Curation 3(1).


Mar 02, 1:13am

Alex Chassanoff  who is a Postdoctoral Fellow in the Program for Information Science, contributes to this detailed wrapup of the recent Data Rescue Boston event that she helped organize.


Data Rescue Boston@MIT Wrap up

Written by event organizers:

Alexandra Chassanoff

Jeffrey Liu

Helen Bailey

Renee Ball

Chi Feng


On Saturday, February 18th, the MIT Libraries and the Association of Computational Science and Engineering co-hosted a day long Data Rescue Boston hackathon at Morss Hall in the Walker Memorial Building.  Jeffrey Liu, a Civil and Environmental Engineering graduate student at MIT, organized the event as part of an emerging North American movement to engage communities locally in the safeguarding of potentially vulnerable federal research information.  Since January, Data Rescue events have been springing up at libraries across the country, largely through the combined organizing efforts of Data Refuge and Environmental Data and Governance Initiative.


The event was sponsored by MIT Center for Computational Engineering, MIT Department of Civil and Environmental Engineering, MIT Environmental Solutions Initiative, MIT Libraries, MIT Graduate Student Council Initiatives Fund, and the Environmental Data and Governance Initiative.

Here are some snapshot metrics from our event:

# of Organizers: 8
# of Volunteers: ~15
# of Guides: 9
# of Participants: ~130
# URLs researched: 200
# URLs harvested: 53
# GiB harvested: 35
# URLs seeded: 3300 at event (~76000 from attendees finishing after event)
# Agency Primers started: 19
# Cups of Coffee: 300
# Burritos: 120
# Bagels: 450
# Pizzas: 105

Goal 1. Process data

MIT’s data rescuers managed to process a similar amount of data through the seeding and harvesting phases of data rescue as compared to other similarly-sized events.  For reference, Data Rescue San Francisco researched 101 URLs and harvested 25 GB of data at their event.  Data Rescue DC, a two-day event which also included a bagging/describing track which we did not have, harvested 20GB of data, seeded 4776 URLs, bagged 15 datasets and described 40 data sets.   

Goal 2. Expand scope

Another goal of our event was to explore creating new workflows for expanding efforts beyond an existing focus on federal agency environmental and climate data.  Toward that end, we decided to pilot a new track called Surveying which we used to identify and describe programs, datasets and documents at federal agencies still in need of agency primers.  We were lucky enough to have particular domain experts on hand who assisted us with our efforts.  In total, we were able to begin expansion efforts for agencies and departments at the Department of Justice, Department of Labor, Health and Human Services, and the Federal Communications Commission.

Goal 3: Engage and build community

Attendees at our event spanned age groups, occupations, and technical abilities.  Participants included research librarians, concerned scientists, and expert undergraduate hackers; according to national developers for the Data Rescue archiving application, MIT had the largest number of “tech-tel” than any other event thus far.   As part of the Storytelling aspect of Data Rescue events, we captured profiles for twenty-seven of our attendees.  Additionally, we created Data Use Stories that describe how some researchers use specific data sets from the National Water Information System (USGS), the Alternative Fuels Data Center (DOE),  and the Global Historical Climate Network (NOAA).  These stories let us communicate how these data sets are used to better understand our world, as well as make decisions that impact our everyday lives.

The hackathon at MIT was the second event hosted by Data Rescue Boston, which has begun hosting weekly working groups every Thursday at MIT  for continuing engagement on compiling tools and documentation to improve workflow, identify vulnerable data sets, and create resources to help further efforts.   

Future Work

Data rescue events continue to gather steam, with eight major national events planned over the next month.  The next DataRescue Boston event will be held at Northeastern on March 24th. A dozen volunteers and attendees from the MIT event have already signed up to help organize workshops and efforts at the Northeastern event.

Press Coverage of our Event:

Jan 21, 9:42am

Alex Chassanoff  who is a Postdoctoral Fellow in the program on information science introduces a series of posts on software curation.

Building A Model for Software Curation:

An Introductory Post


In October 2016, I began working at the MIT Libraries as a CLIR/DLF Postdoctoral Fellow in Software Curation. CLIR began offering postdoctoral fellowships in data curation in 2012; however, myself and three others were part of the first cohort conducting research in the area of Software Curation.  At our fellowship seminar and training this summer,the four of us joked about not having any idea what we would be doing (and Google wasn’t much help). Indeed, despite years of involvement in digital curation, I was unsure of what it might mean to curate software. As has been well-documented in the library/archival science community, curation of data means many different things tomany different people.  Add in the term “software” and you increase the complexities.

At MIT Libraries, I was given the good fortune of working with two distinguished and esteemed experts in library research: Nancy McGovern, the Director of the Digital Preservation Program and Micah Altman, the Director of Research.   This blog post describes the first phase of our work together in defining a research agenda for software curation as an institutional asset.

Defining Scope

As we began to suss out possible research objectives and assorted activities, we found ourselves circling back to four central questions – which themselves split into associated sub-questions.

  • What is software? What is the purpose and function of software? What does it mean to curate software? How do these practices differ from preservation?
  • When do we curate software? Is it at the time of creation? Or when it becomes acquired by an institution?
  • Why do institutions and researchers curate software?
  • Who is institutionally responsible for curating software and for whom are we curating software?

Developing Focus and Purpose

We also began to outline the types of exploratory research questions we might ask depending on the specific purpose and entities we were creating a model for (see Table 1 below). Of course, these are only some of the entities that we could focus on; we could also broaden our scope to include research questions of interest to software publishers, software journals, or funders interested in software curation.


Entity Purpose: Libraries/Archives Purpose: MIT Specific
Research library What does a library need to safeguard + preserve software as an asset? How are other institutions handling this? How are funding agencies considering research on software curation? What are the MIT libraries’ existing and future needs related to software curation?
Software creator What are the best practices software creators should adopt when creating software? How are software creators depositing their software and how are journals recommending they do this? What are the individual needs and existing practices of software creators served by the MIT Libraries?
Software user What are the different kinds of reasons why people may use software? What are the conditions for use? What are the specific curation practices we should implement to make software usable for this community? What do individuals served by the MIT Libraries need to be able to reuse software?

Table 1: Potential purpose(s) of research by entity

Importantly, we wanted to adopt an agile research approach that considered software as an artifact, rather than (simply) as an outcome to be preserved and made accessible.  Curation in this sense might seek to answer ontological questions about software as an entity with significant characteristics at different levels of representation.   Certainly, digital object management approaches that emphasize documentation of significant properties or characteristics are long-standing in the literature.  At the same time, we wanted our approach to address essential curatorial activities (what Abby Smith termed “interventions”) that help ensure digital files remain accessible and usable. [1]  We returned to our shared research vision: to devise a model for software curation strategies to assist research outcomes that rely on the creation, use, reuse, and study of software.

Statement of Research Objectives and Working Definitions

Given the preponderance of definitions for curation and the wide-ranging implications of curating for different purposes and audiences, we thought it would be essential for us to identify and make clear our particular interests.  We developed the following statement to best describe our goals and objectives:

Libraries and archives are increasingly tasked with responsibilities related to the effective long-term preservation and curation of software.  The purpose of our work is to investigate and make recommendations for strategies that institutions can adopt for managing software as complex digital objects across generations of technology.

We also developed the following working definition of software curation for use in our research:

“Software curation encompasses the active practices related to the creation, acquisition, appraisal and selection, description, transformation, preservation, storage, and dissemination/access/reuse of software over short- and long- periods of time.”

What’s Next

The next phase of our research involves formalizing our research approach through the evaluation, selection, and application of relevant models (such as the OAIS Reference Model) and ontologies (such as the SWO). We are also developing different scenarios to establish the roles and responsibilities bound up in software creation, use, and reuse. In addition to reporting on the status of our project, you can expect to read blog posts about both the philosophical and practical implications of curating software in an academic research library setting.


[1] In the seminal collection Authenticity in a digital environment, Abby Smith noted that “We have to intervene continually to keep digital files alive. We cannot put a digital file on a shelf and decide later about preservation intervention. Storage means active intervention.” See: Abby Smith (2000) “Authenticity in Perspective  Authenticity in a digital environment. Washington, D.C: Council on Library and Information Resources.

Dec 13, 1:54pm

Zachary Lizee  who is a Graduate Research Intern in the program on information science, reflects on his investigations into information standards, and suggests how  libraries can reach beyond local instruction on digital literacy to scaleable education on to catalyze information citizenship.

21st century Libraries, Standards Education and Socially Responsible Information Seeking Behavior

Standards and standards development frame, guide, and normalize almost all areas of our lives.  Standards in IT govern interoperability between a variety of devices and platforms, standardized production of various machine parts allows uniform repair and reproduction, and standardization in fields like accounting, health care, or agriculture promotes best industry practices that emphasize safety and quality control.  Informational standards like OpenDocument allows storage and processing of digital information to be accessible by most types of software ensuring that the data is recoverable in the future.[1]  Standards reflect the shared values, aspirations, and responsibilities we as a society project upon each other and our world.

Engineering and other innovative entrepreneurial fields need to have awareness aboutinformation standards and standards development to ensure that the results of research, design, and development in these areas have the most positive net outcome for our world at large, as illustrated by the analysis of healthcare information standards by HIMSS, a professional organization that works to affect informational standards in the healthcare IT field:

In healthcare, standards provide a common language and set of expectations that enable interoperability between systems and/or devices. Ideally, data exchange schema and standards should permit data to be shared between clinician, lab, hospital, pharmacy, and patient regardless of application or application vendor in order to improve healthcare delivery. [2]

As critical issues regarding information privacy quickly increase, standard development organizations and interested stakeholders take an active interest in creating and maintaining standards to regulate how personal data is stored, transferred, and used, which has both public interests and regulation by legal frameworks in mind.[3]

Libraries have traditionally been centers of expertise/access of information collection, curation, dissemination, and instruction.  And the standards around how digital information is produced, used, governed, and transmitted are rapidly evolving with new technologies.[4]  Libraries are participating in the processes of generating information standards to ensure that patrons can freely and safely access information.  For instance, the National Information Standards Organization is developing informational standards to address patron privacy issues in library data management systems:

The NISO Privacy Principles, available at privacy/, set forth a core set of guidelines by which libraries, systems providers and publishers can foster respect for patron privacy throughout their operations.  The Preamble of the Principles notes that, ‘Certain personal data are often required in order for digital systems to deliver information, particularly subscribed content.’ Additionally, user activity data can provide useful insights on how to improve collections and services. However, the gathering, storage, and use of these data must respect the trust users place in libraries and their partners. There are ways to address these operational needs while also respecting the user’s rights and expectations of privacy.[5]

This effort by NISO (which has librarians on the steering committee) illustrates how libraries engage in outreach and advocacy that is also in concert with the ALA’s Code of Ethics, which states that libraries have the duty to protect patron’s rights to privacy and confidentiality regarding information seeking behavior.  Libraries and librarians have a long tradition of engaging in social responsibility for their patrons and community at large.

Although libraries are sometimes involved, most information standards are created by engineers working in corporate settings, or are considerably influenced by the development of products that become the model.  Most students leave the university without understanding what standards are, how they are developed, and what potential social and political ramifications advancements in the engineering field can have on our world.[6]

There is a trend in the academic and professional communities to foster greater understanding about what standards are, why they are important, and how they relate to influencing and shaping our world.[7]  Understanding the relevance of standards will be an asset that employers in the engineering fields will value and look for.  Keeping informed about the most current standards can drive innovation and increase the market value of an engineer’s research and design efforts.[8]

As informational hubs, libraries have a unique opportunity to participate in developing information literacy regarding standards and standards development.  By infusing philosophies regarding socially responsible research and innovation, using standards instruction as a vehicle, librarians can emphasize the net positive effect of standards and ethics awareness for the individual student and the world at large.

The emergence of MOOCs creates an opportunity for librarians to reach a large audience to instruct patrons in information literacy in a variety of subjects. MOOCs can have a number of advantages when it comes to being able to inform and instruct a large number of people from a variety of geographic locations and across a range of subject areas.[9]

For example, a subject specific librarian for an engineering department at a university could participate with engineering faculty in developing a MOOC that outlines the relative issues, facts, and procedures surrounding standards and standards development to aid the engineering faculty in instructing standards education.  Together, librarians and subject experts could  develop education on the roles that standards and socially responsible behavior factor into the field of engineering.

Students that learn early in their career why standards are an integral element in engineering and related fields have the potential to produce influential ideas, products, and programs that undoubtedly could have positive and constructive effects for society.  Engineering endeavors to design products, methodologies, and other technologies that can have a positive impact on our world.  Standards education in engineering fields can produce students who have a keen understanding of social awareness about human dignity, human justice, overall human welfare, and a sense of global responsibility.

Our world has a number of challenges: poverty, oppression, political and economic strife, environmental issues, and a host of many other dilemmas socially responsible engineers and innovators could address.  The impact of educating engineers and innovators about standards and socially responsible behavior can affect future corporate responsibility, ethical and humanitarian behavior, altruistic technical research and development, which in turn yields a net positive result for the individual, society, and the world.

Recommended Resources:


[1] OASIS, “OASIS Open Document Format for Office Applications TC,” < committees/tc_home.php?wg_abbrev=office>

[2] HIMSS, “Why do we need standards?,” <>

[3] Murphy, Craig N. and JoAnne Yates, The International Organization for Standardization (ISO): Global governance through voluntary consensus, London and New York: Routledge, 2009.

[4] See Opening Standards: The Global Politics of Interoperability, edited by Laura DeNardis, Cambridge, Massachusetts: MIT Press, 2011.

[5] “NISO Releases a Set of Principles to Address Privacy of User Data in Library, Content-Provider, and Software-Supplier Systems,” NISO,  <>

[6] “IEEE Position Paper on the Role of Technical Standards in the Curriculum of Academic Programs in Engineering, Technology and Computing,” IEEE,  <>

[7] Northwestern Strategic Standards Management, <>

[8] “Education about standards,” ISO, <>

[9] “MOOC Design and Delivery: Opportunities and Challenges,” Current Issues in Emerging ELearning, V.3, Issue 1,(2016) <>

Dec 08, 11:51am

Dr. Anthony Scriffignano, who is SVP/Chief Data Scientist at Dun and Bradstreet, gave this talk on Making Decisions in a World Awash in Data: We’re going to need a different boat
as part of the Program on Information Science Brown Bag Series.

In the talk, illustrated by the slides below, Scriffignano argues that the massive collection of ‘unstructured’ data enables a wide set of potential inferences about complex changing relationships.  At the same time, his talk notes that it is increasingly easy to gather sufficient information to take action — while lacking enough information to  form good judgement, and further understanding of the context in which data is collected and flows is essential to developing such good judgements.

Scriffignano summarizes his talk in the following abstract:

l explore some of the ways in which the massive availability of data is changing and the types of questions we must ask in the context of making business decisions.  Truth be told, nearly all organizations struggle to make sense out of the mounting data already within the enterprise.  At the same time, businesses, individuals, and governments continue to try to outpace one another, often in ways that are informed by newly-available data and technology, but just as often using that data and technology in alarmingly inappropriate or incomplete ways.  Multiple “solutions” exist to take data that is poorly understood, promising to derive meaning that is often transient at best.  A tremendous amount of “dark” innovation continues in the space of fraud and other bad behavior (e.g. cyber crime, cyber terrorism), highlighting that there are very real risks to taking a fast-follower strategy in making sense out of the ever-increasing amount of data available.  Tools and technologies can be very helpful or, as Scriffignano puts it, “they can accelerate the speed with which we hit the wall.”  Drawing on unstructured, highly dynamic sources of data, fascinating inference can be derived if we ask the right questions (and maybe use a bit of different math!).  This session will cover three main themes: The new normal (how the data around us continues to change), how are we reacting (bringing data science into the room), and the path ahead (creating a mindset in the organization that evolves).  Ultimately, what we learn is governed as much by the data available as by the questions we ask.  This talk, both relevant and occasionally irreverent, will explore some of the new ways data is being used to expose risk and opportunity and the skills we need to take advantage of a world awash in data.

This covers a broad scope, and Dr. Scriffignano expands  extensively on these and other issue in his blog  — which is well worth reading.  

Dr. Scriffignano’s talk raised a number of interesting provocations. The talk claims, for example that:

On data.

  • No data is real-time — there are always latencies in measurement, transmission, or analysis.
  • Most data is worthless — but there remains a tremendous number of useful signals in data that we don’t understand.
  • Eighty-five percent of data collected today is ‘unstructured’. And unstructured’ data is really data that has structure that we do not yet understand.

On using data.

  • Unstructured data has the potential to support many unanticipated inferences. An example (which Scriffiganno calls a “data-bubble) is a set of photographs of crowd-sourced photos of recurring events — one can find photos that are taken at different times but which show the same location from the same perspective. Despite being convenient samples, they permit new longitudinal comparisons from which one could extract signals of fashion, attention, technology use, attitude, etc. —  and big data collection has created qualitatively new opportunities for inference.
  • When collecting and curating data we need to pay close attention to decision-elasticity — how different would our information have to be to change our optimal action?  In designing a data curation strategy, one needs to weigh the opportunity costs of obtaining data and curating data, against the potential to affect decisions.
  • Increasingly, big data analysis raises ethical questions.  Some of these questions arise directly: what are ethical expectations on use of ‘new’ signals we discover that can be extracted from unstructured data?  Others arise through the algorithms we choose — how they introduce biases– and how do we even understand what algorithms do, especially as use of artificial intelligence grows? Scriffigano’s talk gives as an example of recent AI research in which two algorithms develop their own private encryption scheme.

This is directly relevant to the future of research, and the future of research libraries.  Research will increasingly rely on evidence sources of these types — and increasing need to access, discover and curate this evidence.  And our society will increasingly be shaped by this information, and how we choose to engineer and govern collection and use of this information.  The private sector is pushing ahead fast in this area, and will no doubt generate many innovative data collections and algorithms.  Engagement from university scholars, researchers, and librarians is vital to ensure that society understands these new creations; is able to evaluate their reliability and bias; and has durable and equitable access to them to provide accountability and to support  important discoveries that are not easily monetized.  For those interested in this topic, — the  Program on Information Science has published reports and articles on big data inference and ethics.