Thoughts on Mitigating Threats to Data Quality Throughout the Curation Lifecycle

The workshop report from the UNC Curating for Data Quality workshop, in which I was delighted to participate, is now being made available. It contains many perspectives addressing a number of questions:

Data Quality Criteria and Contexts. What are the characteristics of data quality? What threats to data quality arise at different stages of the data life cycle? What kinds of work processes affect data quality? What elements of the curatiorial process most strongly affect data quality over time? How do data types and contexts influence data quality parameters?

Human and Institutional Factors. What are the costs associated with different levels of data quality? What kinds of incentives and constraints influence efforts of different stakeholders? How does one estimate the continuum from critical to tolerable errors? How often does one need to validate data?

Tools for Effective and Painless Curation. What kinds of tools and techniques exist or are required to insure that creators and curators address data quality?

Metrics. What are or should be the measures of data quality? How does one identify errors? How does one correct errors or mitigate their effects?

My current perspective, after reflecting on seven ‘quality’ frameworks from different disciplines that differ in complex and deep ways, is that the data quality criteria implied by the candidate frameworks are neither easily harmonized, nor readily quantified. Thus, a generalized systematic approach to evaluating data quality seems unlikely to emerge soon. Fortunately, developing an effective approach to digital curation that respects data quality does not require a comprehensive definition of data quality. Instead, we can appropriately address “data quality” in curation by limiting our consideration to a narrower applied questions:

Which aspects of data quality are (potentially) affected by (each stage of) digital curation activity? And how do we keep invariant data quality properties at each curation stage?

A number of approaches suggest seem particularly likely to bear fruit:

  1. Incorporate portfolio diversification in selection and appraisal.
  2. Support validation of preservation quality attributes such as authenticity, integrity, organization, and chain of custody throughout long-term preservation and use — from ingest through delivery and creation of derivative works.
  3. Apply semantic fingerprints for quality evaluation during ingest, format migration and delivery.

These approaches have the advantage of being independent of the content subject area, of the domain of measure, and of the particular semantics content of objects and collections. Therefore, they are broadly applicable. By mitigating these broad-spectrum threats to quality, we can improve the overall quality of curated collections, and their expected value to target communities.

My extended thoughts are here:

You may also be interested in the other presentations from the workshop, which are posted on the Conference Site.

See also: drmaltman