A Libraries & Big Data Brown Bag brought to you by the Information Science program at the MIT Libraries.
Many consider “searching” a solved problem, and for digital text processing, this belief is factually based. The problem is that many "real world" search applications involve "complex documents", and such applications are far from solved. Complex documents, or less formally, "real world documents", comprise of a mixture of images, text, signatures, tables, etc., and are often available only in scanned hardcopy formats. Some of these documents are corrupted. Some of these documents, particularly of historical nature, contain multiple languages. Accurate search systems for such document collections are currently unavailable.
This class describes efforts at building a complex document information-processing prototype. This prototype integrates "point solution" (mature) technologies, such as document readability enhancement, OCR capability, signature matching and handwritten word spotting techniques, search and mining approaches, among others, to yield a system capable of searching "real world documents". The described prototype demonstrates the adage that "the whole is greater than the sum of its parts". Previous complex document benchmark development efforts are likewise presented.
Having described “real world” search issues, this class focuses on spelling correction in adverse environments. Two environments will be discussed: foreign name search and medical term search. In support of the Yizkor Books project of the Archives Section of the United States Holocaust Memorial Museum, novel foreign name search approaches that favorably compare against the state of the art are developed. By segmenting names, fusing individual results, and filtering via a threshold, our approach statistically significantly improves traditional Soundex and n-gram based search techniques used in the search of such texts. Thus, previously unsuccessful searches are now supported. Using a similar approach, within the medical domain, automated term corrections are made to reduce transcription errors.
Finally, we focus on analyzing social media, an additional, non-traditional search environment. By searching and mining such data, unknown or unexpected trends will be detected. We explore and demonstrate the validity of the approach in the healthcare space.
About the discussant:
Ophir Frieder holds the Robert L. McDevitt, K.S.G., K.C.H.S. and Catherine H. McDevitt L.C.H.S. Chair in Computer Science and Information Processing and previously served as the Chair of the Department of Computer Science at Georgetown University. He is also Professor of Biostatistics, Bioinformatics and Biomathematics in the Georgetown University Medical Center. In addition to his academic positions, he is the Chief Scientific Officer for UMBRA Health Corp.(UHC). He is a Fellow of the AAAS, ACM, IEEE, and NAI.
Information Science Brown Bag talks, hosted by the Program on Information Science, consists of regular discussions and brainstorming sessions on all aspects of information science and uses of information science and technology to assess and solve institutional, social and research problems. These are informal talks. Discussions are often inspired by real-world problems being faced by the lead discussant. The Information Science Program will provide lunch, please bring your favorite beverage and plenty of questions.