Why search is not a solved (by google) problem, and why Universities Should Care: Ophir Frieder’s Talk

Ophir Frieder, who holds the Robert L. McDevitt, K.S.G., K.C.H.S. and Catherine H. McDevitt L.C.H.S. Chair in Computer Science and Information Processing at Georgetown University and is Professor of Biostatistics, Bioinformatics, and Biomathematics at the Georgetown University Medical Center,  gave this talk on  Searching in Harsh Environments as part of the Program on Information Science Brown Bag Series.

In the talk, illustrated by the slides below, Ophir  rebuts the myth that “Google has solved search”, and discusses the challenges of searching for complex objects, through hidden collections, and in harsh environments

In his abstract, Ophir summarizes as follows:

Many consider “searching” a solved problem, and for digital text processing, this belief is factually based.  The problem is that many “real world” search applications involve “complex documents”, and such applications are far from solved.  Complex documents, or less formally, “real world documents”, comprise of a mixture of images, text, signatures, tables, etc., and are often available only in scanned hardcopy formats.   Some of these documents are corrupted.  Some of these documents, particularly of historical nature, contain multiple languages.  Accurate search systems for such document collections are currently unavailable.

The talk discussed three projects. The first project involved developing methods to search collections of complex digitized documents which varied in format, length, genre, and digitization quality; contained diverse fonts, graphical elements, and handwritten annotations; and were subject to errors due to document deterioration and from the digitization process. A second project involved developing methods to enable searchers who arrive with sparse, fragmentary, error-ridden clues  about places and people to successfully find relevant  connected  information in the Archives Section of the United States Holocaust Memorial Museum. A third project involved monitoring Twitter for public health events without relying on a prespecified hypothesis.

Across these projects, Frieder raised a number of themes:

  • Searching on complex objects is very different from searching the web. Substantial portions of complex objects are invisible to current search. And current search engines do understand the semantics of relationships within and among objects — making the right answers hard to find.
  • Searching across most online content now depends on proprietary algorithms, indices, and logs.
  • Researchers need to be able to search collections of content that may never be made available publicly online by Google or other companies.

Despite the increasing amount of born digital material, I speculate that these issues will become more salient to research, and that libraries have a role to play in addressing them.

While much of the “scholarly record” is currently being produced in the form of “pdf”s, which are amenable to the Google searching approach, much web-based content is dynamically generated and customized, and scholarly publications are increasingly incorporating dynamic and interactive features. Searching these will effectively will require engaging with scientific output as complex objects

Further, some areas of science, such as the social sciences, increasingly rely on proprietary collections of big data from commercial sources. Much of this growing evidence base is currently accessible only through proprietary API’s. To meet the heightened requirements for transparency and reproducibility, stewards are needed for these data who can ensure nondiscriminatory long-term research access.

More generally, it is increasingly well recognized that the evidence base of science not only includes published articles, community datasets (and benchmarks); but also may extends to scientific software, replication data, workflows, and even electronic lab notebooks. The article produced at the end is simply a summary description of one pathway the evidence reflected in theses scientific objects. Validating, reproducing, and building on science may increasingly require access to, search over, and understanding of this entire complex set.  

See also: drmaltman