Biomedical text mining

From Clinfowiki
Jump to: navigation, search

Biomedical Text Mining



One of my daily tasks as a librarian is searching in the published literature for articles on a specific topic. For example, a pediatrician needs the current practice guideline on immunizations. I find the needed article by searching in a biomedical bibliographic database called Medline. The search strategies employed in this task include using MeSH, Boolean operators, limits, etc. I am part of the labor force of manual text mining and didn’t know it.

What is text mining? “Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation.” [1]

As information seekers, we find medical information from two types of sources: structured (e.g. blood pressure, gender, age found in medical records) and unstructured (e.g. text in documents, web pages, manuals, reports, email, faxes, presentations, and published literature). Text mining applies to both types, but text mining tools are designed to target unstructured data, such as published literature.

In biomedicine, the volume of published biomedical research has resulted in exponentially growing biomedical knowledge base. Medline 2007, for example, contains 15 million records, a 2.5 million increase from 2004. [1] This knowledge base can be useful in aiding researchers in the diverse subfields of biomedicine to discover new knowledge. This is accomplished by biomedical text mining. Using this technology, researchers can identify needed information more efficiently, discover relationships obscured by the sheer volume of available information. The text mining tools such as SAS, ORACLE TEXT efficiently carry the burden of information overload. These tools employ algorithmic and statistical methods, context indexing, decision trees, filtering, etc.


In their article, “A survey of current work in biomedical text mining” published in 2005, Cohen and Hersh described five themes in text mining under current research.

  • NAMED ENTITY RECOGNITION – identify all instances of a name for a specific thing (e.g. all of the gene names and symbols within a collection of articles) to extract key concepts of interest and allow those concepts to be represented in a consistent form.
  • TEXT CLASSIFICATION – determine whether a document has certain characteristics of interest. Database curators found that this technique reduced the number of abstracts they have to read by two thirds.
  • SYNONYM AND ABBREVIATION EXTRACTION – automate the collection and mapping of synonyms and abbreviation of biomedical entities.
  • RELATIONSHIP EXTRACTION – detect a specific type of relationship between two entities, e.g. biochemical association.
  • HYPOTHESIS GENERATION – identify unrecognized relationships worthy of further investigation that could lead to promising hypotheses.


Biomedical text mining increases usability and quality of text data. This enables researchers to use this data in clinical decision process. However, text-mining researchers need to work together towards interdisciplinary coordination and cooperation to develop tools based on real-world needs. [3]


  1. accessed May 25, 2007.
  2. accessed May 28, 2007.
  3. Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Brief Bioinform. 2005 Mar; 6(1):57-71.

blp001/Beshia Popescu