Information Retrieval

From Clinfowiki
Jump to: navigation, search

Information Retrieval (IR) is the discipline concerned with optimizing the storage, searching, and retrieval of documents, particularly textual documents from databases and the Web. Historically it has been an interdisciplinary blend of the computer science, library science, linguistics and mathematics fields. More recently there has been an emphasis on it usage in biology, health care and medicine and it is considered a key subdomain of Biomedical Informatics.


Work in the 1960’s developed research test systems and evaluation methodologies though large collections or databases were lacking. A pioneering effort led by Gerard Salton at Cornell University was the SMART (System for the Mechanical Analysis and Retrieval of Text). The National Library of Medicine (NLM) began its on-going support of medical IR with its Medical Literature Analysis & Retrieval System (MEDLARS), one of the first electronic retrieval systems.

The exponential growth of computing power in the 1970’s and 1980’s supported research and large scale testing of new techniques for content indexing and query optimizations. The NLM continued its work with its MEDLINE system. Access to this medical database was largely done through trained librarians or other intermediaries. These databases were primarily “bibliographic”; that is listings of articles, books or other materials held in a library. The output of a query was a list of relevant materials to be reviewed at a local library.

The 1990’s explosion in network technologies allowed more direct access to databases and research intensified in interface and query optimizations along with research using large “textual” databases. NLM released its “Pubmed” interface into the MEDLINE system. The Text REtrieval Conference (TREC), an annual conference and series of workshops, began in 1992 with the sponsorship of the National Institute of Standards and Technology and the Disruptive Technology Office of the U.S. Department of Defense. The annual conference provides a forum for “challenge evaluations” of new techniques against a “standardized task and/or data collection”. The emergence of the Web brought information querying and retrieval to the masses


This discipline is at the forefront of modern health care research. Its techniques are central to the emergence of the sciences of genomics and proteomics that require huge databases and abilities to query and extract information and relationships. The Web has provided a new, almost infinitely large test bed for research and the development and rollout of electronic health records will depend on advances in this area.


This is the process of associating terms with a document that can be matched against with a query. There are two methods to indexing, the first labeled human and the second automated. The human method involves a trained indexer assigning terms to a document from a controlled vocabulary. The MEDLINE system uses human indexing from the MeSH vocabulary [1].

Computers do automated indexing. Words in a document are indexed though some common words may be filtered out. This is the simplest form. Terms may be weighted by a variety of methods.


Retrieval also uses two primary approaches. These are Boolean and natural language and can be used with either indexing method. With Boolean the user queries with terms associated by AND or OR. The NOT operator may also be used to exclude terms.

With the natural language approach the user queries with only words. The system then attempts to find matches. If the documents were indexed automatically the weighting scheme may be used to help rank documents for “relevance” [2].


Researches are constantly striving to improve IR systems and utilize measurements of effectiveness called evaluations. Again there are two general approaches to evaluations. The historical method is System Oriented and a newer approach is labeled user evaluations.

System oriented evaluations use mathematical or statistical approaches to evaluate the IR systems against known, test collections. The most popular measures are recall and precision.

  • Recall is the fraction of all relevant documents that were retrieved.
  • Precision is what fraction of the retrieved documents are relevant. These measures can be automatically computed from known collections and be used to assess IR developments.

In user evaluations approach users are given a set of questions or tasks to complete. Their success at retrieving relevant documents is assessed across both system and user demographic information.

Some recent studies have questioned the correlation of system oriented improvement measures to actual effectiveness on a user’s success at searching [3].


  1. Lowe, H. and G. Barnett, Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. Journal of the American Medical Association, 1994. 271: 1103-1108.
  2. Salton, G., Developments in automatic text retrieval. Science, 1991. 253: 974-980.
  3. Turpin, Al and Hersh, W. (2001) Why batch and user evaluations do not give the same results. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Developments in Information Retrieval, New Orleans, LA, ACM Press, 225-231
  4. Hersh, William (2009) Information Retrieval – A Health and Biomedical Perspective, 3rd edition. Springer
  5. High speed clinical data retrieval system with event time sequence feature: with 10 years of clinical data of Hamamatsu University Hospital CPOE

Submitted by Jeff Emch