Alex Chassanoff is a CLIR/DLF Postdoctoral Fellow in the Program on Information Science and continues a series of posts on software curation.
In this blog post, I am going to reflect upon potential strategies that institutions can adopt for making legacy software curation-ready. The notion of “curation-ready” was first articulated as part of the “Curation Ready Working Group”, which formed in 2016 as part of the newly emerging Software Preservation Network (SPN). The goal of the group was to “articulate a set of characteristics of curation-ready software, as well as activities and responsibilities of various stakeholders in addressing those characteristics, across a variety of different scenarios”. Drawing on inventory at our own institutions, the working group explored different strategies and criteria that would make software “curation-ready” for representative use cases. In my use case, I looked specifically at the GRAPPLE software program and wrote about particular use and users for the materials.
This work complements the ongoing research I’ve been doing as a Software Curation Fellow at MIT Libraries  to envision curation strategies for software. Over the past six months, I have conducted an informal assessment of representative types of software in an effort to identify baseline characteristics of materials, including functions and uses.
Below, I briefly characterize the state of legacy software at MIT.
Legacy software often exists among hybrid collections of materials, and can be spread across different domains.
Different components(e.g., software dependencies, hardware) may or may not be co-located.
Legacy software may or may not be accessible on original media. Materials are stored in various locations, ranging from climate-controlled storage to departmental closets.
Legacy software may exist in multiple states with multiple contributors over multiple years.
Different entities (e.g., MIT Museum, Computer Science and Artificial Intelligence Laboratory, Institute Archives & Special Collections) may have administrative purview over legacy software with no centralized inventory available.
Collected materials may contain multiple versions of source code housed in different formats (e.g., paper print outs, on multiple diskettes) and may or may not consist of user manuals, requirements definitions, data dictionaries, etc.
Legacy software has a wide range of possible scholarly use and users for materials. These may include the following: research on institutional histories (e.g., government-funded academic computing research programs), biographies (e.g., notable developers and/or contributors of software), socio-technical inquiries (e.g., extinct programming languages, implementation of novel algorithms), and educational endeavors (e.g., reconstruction of software).
We define curation-ready legacy software as having the following characteristics: being discoverable, usable/reusable, interpretable, citable, and accessible. Our approach views curation as an active, nonlinear, iterative process undertaken throughout the life (and lives) of a software artifact.
Steps to increase curation-readiness for legacy software
Below, I briefly describe some of the strategies we are exploring as potential steps in making legacy software curation-ready. Each of these strategies should be treated as suggestive rather than prescriptive at this stage in our exploration.
Identify appraisal criteria. Establishing appraisal criteria is an important first step that can be used to guide decisions about selection of relevant materials for long-term access and retention. As David Bearman writes, “Framing a software collecting policy begins with the definition of a schema which adequately depicts the universe of software in which the collection is to be a subset.” It is important to note that for legacy software, determining appraisal criteria will necessarily involve making decisions about both the level of access and preservation desired. Decision-making should be guided by an institutional understanding of what constitutes a fully-formed collection object. In other words, what components of software should be made accessible? What will be preserved? Does the software need to be executable? What levels of risk assessment should be conducted throughout the lifecycle? Making these decisions institutionally will in turn help guide the identification of appropriate preservation strategies (e.g., emulation, migration, etc) based on desired outcomes.
Identify, assemble, and document relevant materials. A significant challenge with legacy software lies in the assembling of relevant materials to provide necessary context for meaningful access and use. Locating and inventorying related materials (e.g., memos, technical requirements, user manuals) is an initial starting point. In some cases, meaningful materials may be spread across the web at different locations. While it remains a controversial method in archival practice, documentation strategy may provide useful framing guidance on principles of documentation .
Identify stakeholders. Identifying the various stakeholders, either inside or outside of the institution, can help ensure proper transfer and long-term care of materials, along with managing potential rights issues where applicable. Here we draw on Carlson’s work developing the Data Curation Profile Toolkit and define stakeholders as any group, organizations, individuals or others having an investment in the software, that you would feel the need to consult regarding access, care, use, and reuse of the software.
Describe and catalog materials. Curation-readiness can be increased by thoroughly describing and cataloging select materials, with an emphasis on preserving relationships among entities. In some cases, this may consist of describing aspects of the computing environment and relationships between hardware, software, dependencies, and/or versions. Although the software itself may not be accessible, describing related materials (i.e., printouts of source code, technical requirements documentation) adequately can provide important points of access. It may be useful to consider the different conceptual models of software that have been developed in the digital preservation literature and decide which perspective aligns best with your institutional needs .
Digitize and OCR paper materials. Paper printouts of source code and related documentation can be digitized according to established best practice workflows. The use of optical character recognition (OCR) programs produces machine-readable output, enabling easy indexing of content to enhance discoverability and/or textual transcriptions. The latter option can make historical source code more portable for use in simulations or reconstructions of software.
Migrate media. Legacy software often reside on unstable media such as floppy disks or magnetic tape. In cases where access to the software itself is desirable, migrating and/or extracting media contents (where possible) to a more stable medium is recommended .
As an active practice, software curation means anticipating future use and uses of resources from the past. Recalling an earlier blog post, our research aims to produce software curation strategies that embrace Reagan Moore’s theoretical view of digital preservation, whereby “information generated in the past is sent into the future”. As the born-digital record increases in scope and volume, libraries will necessarily have to address significant changes in the ways in which we use and make use of new kinds of resources. Technological quandaries of storage and access will likely prove less burdensome than the social, cultural, and organizational challenges of adapting to new forms of knowledge-making. Legacy software represents this problem space for libraries/archives today. Devising curation strategies for software helps us to learn more about how knowledge-embedded practices are changing and gives us new opportunities for building healthy infrastructures .
 These are some of the open research questions being addressed by the initial cohort of CLIR/DLF Software Curation Fellows in different institutions across the country.
 Bearman, D. (1985). Collecting software: a new challenge for archives & museums. Archives & Museum Informatics, Pittsburgh, PA.
 Documentation strategy approaches archival practice as a collaborative work among record creators, archivists, and users. It often traverses institutions and represents an alternative approach by prompting extensive documentation organized around an “ongoing issue or activity or geographic area.” See: Samuels, H. (1991). “Improving our disposition: Documentation strategy,” Archivaria 33, http://web.utk.edu/~lbronsta/Samuels.pdf.
 The results of two applied research projects provide examples from the digital preservation literature. In 2002, the Agency to Research Project at the National Archives of Australia developed a conceptual model based on software performance as a measure of the effectiveness of digital preservation strategies. See: Heslop, H., Davis, S., Wilson, A. (2002). “An approach to the preservation of digital records,” National Archives of Australia, 2002; in their 2008 JISC report, the authors proposed a composite view of software with the following four entities: package, version, variant, and download. See: Matthew, B., McIlwrath, B., Giaretta, D., Conway, E. (2008).“The significant properties of software: A study,” https://epubs.stfc.ac.uk/manifestation/9506.
 Technical guidelines for digitizing archival materials for electronic access: Creation of production master files–raster images. (2005). Washington, D.C.: Digital Library Federation, https://lccn.loc.gov/2005015382/
 Moore, R. (2008). “Towards a theory of digital preservation”, International Journal of Digital Curation 3(1).
 Thinking about software as infrastructure provides a useful framing for envisioning strategies for curation. Infrastructure perspectives advocate “adopting a long term rather than immediate timeframe and thinking about infrastructure not only in terms of human versus technological components but in terms of a set of interrelated social, organizational, and technical components or systems (whether the data will be shared, systems interoperable, standards proprietary, or maintenance and redesign factored in).” See: Bowker, G.C., Baker, K., Millerand, F. & Ribes, D. (2010). “Toward information infrastructure studies: Ways of knowing in a networked environment.” In J. Hunsinger, L. Klastrup, & M. All en (Eds.),International handbook of Internet research. Dordrecht; Springer, 97-117.