Chemical structures on the web and in the scholarly literature

Chemical structures on the web and in the scholarly literature


For chemical structure information on the web and in journal articles to be useful, it has to be:
  • Machine readable - i.e. the structure should be in a form that can be read and understood by a computer
  • Discoverable - i.e. you can search and find documents that contain chemical structures
  • Accessible - i.e. the content of the articles should be accessible by a computer
  • Contextualized - i.e. able to be related to the context of the reference

In an ideal world, publishers of material on the web and in journal articles would include canonical representations of chemical structures with the article (either tagged in the article or as supplemental material) along with tagging or annotation of related material (genes, biological actions, reactions, and so on). We currently have a number of barriers to achieving this, so we will look at what these barriers are and how they can be overcome.

Making structures in documents machine readable


Most documents are not designed with machine processing in mind: they are designed to be read by humans. Humans are very good at pattern recognition and language processing, and this is reflected in how compounds are represented. Here are some ways that ibuprofen might be referenced in a paper, for instance:

IBUPROFEN, MOTRIN, ANDRAN, BRUFEN, LIPTAN, ADVIL, BUTYLENIN, IBUPROCEN, ANFLAGEN, BUBURONE
2-[4-(2-methylpropyl)phenyl]propanoic acid
ipuprofen4.gif
All over-the-counter NSAIDs
ibuprofen.gifibuprofen2.jpg
ibuprofen3.jpgibuprofen_syn1.gif

By far the most preferable way to address this is, of course, for authors to be mindful of the need to machine read structural information, and to supply machine readable chemical structures in one of the standard formats (SMILES, InChI, etc) along with an article - either with a link or reference to the representation within the article, or added as a supplemental material. Structures in supplemental material can be harder to contextualize than those linked directly in the document. Once this information is available, it is easy to link it using stardardized techniques. For example, in an HTML document we might simply have a link to a structure file, e.g.
... we found that long term use of <a href="ibuprofen.sdf">Ibuprofen</a> is associated
  with an elevated risk of stroke ...
or a database reference such as to PubChem
... we found that long term use of <a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=3672">
 Ibuprofen</a> is associated with an elevated risk of stroke ...
or even better tag with XML, although this brings up the issue of term standardization:
... we found that long term use of <COMPOUND SMILES="CC(C)CC1=CC=C(C=C1)C(C)C(=O)O">
 Ibuprofen</COMPOUND> is associated with an elevated risk of stroke ...
Or we could have an InChI instead of the SMILES.

Now we would have solved the problem if we had structural representations provided for all documents. But right now, we don't. In fact, hardly any web or journal documents contain this information. So what can we do?

Well, we have to try to recreate machine-readable forms from human readable forms. There are a few ways of doing this:

  • Name ontologies (really just synonym lookups)
  • Name to structure conversion programs (e.g. OSCAR3 , Lexichem )
  • Image to structure conversion programs (e.g. ChemReader , CLiDE )

Natural language processing can help to differentiate structures based on syntactic context. These methods are far less than perfect, but do work quite well, in the main. However, there are many challenges: for example, how to handle references to groups of compounds or generic compounds (NSAIDs, COX-2 inhibitors, etc), and complex diagrams.

Currently the easiest way of using this information is to create databases that link paper or document ID's (e.g. a PubMed ID or a URL) with SMILES, InChIs, or some other compound ID (e.g. PubChem CID)

Note that most documents are currently very unfriendly for computers: for example PDFs are very good aesthetically but "destroy" data (e.g. a table becomes an image or just text). An alternative has been dubbed the "datument ".

Discoverability


In a web context, SMILES and InChIs are problematic for searching due to their size and use of punctuation symbols. Consequently the InChIKey was developed, a hashed version of the InChI. Here is the InChIKey for Ibuprofen: HEFNNWSXXWATRW-UHFFFAOYSA-N. You can try doing a Google search with this key.

In addition to using an existing search engine, we could create a web crawler specifically for chemical information, and index it in a database.

Contextualization


If a structure is marked up in a document, or is in supplemental material but tagged with a location in the document, then we can attempt to contextualize the molecule by looking at the words in the sentence, paragraph, and document in which the structure is contained. There are a few ways of doing this: statistical analysis (where we look for co-occurence of the compound with other terms of interest in the document, from a statistical perspective), and natural language processing (where we analyze the text to understand the syntax).

The simplest statistical analysis is to look for co-location of terms in the text. For example, we can look not just at individual compounds in the text, but at all the compounds and how they related to each other (are they similar? Are there certain groups?). We can also look at co-occurence of compounds with ontological terms from a domain, for example Gene Ontology terms. Previous work on text analysis for biology and other fields has looked at questions like how to weight abstract vs full text of a journal article, say, or the relative weighting of co-location in a sentence, paragraph or document. See, for example, BMC Bioinformatics 2009, 10:311 and BMC Bioinformatics 2009, 10:46 . Very little work has been done in chemistry or with chemical structures.

Natural language processing takes this one step further by understanding what kinds of words are present (nouns, verbs, prepositions, etc) and thus enabling real relationships to be established (for example, compound x inhibits protein y). Some initial work on this has been done at Indiana (see Journal of Chemical Information and Modeling, 2009; 49(2), pp 263-269 )

Once we establish relationships between compounds and other entities (e.g. protein targets) we can plug into all kinds of information networks. Semantic Web tools are especially useful for this, particularly OWL and RDF - see for example Bio2RDF and LODD .

Accessibility


There are three relevant levels of accessibility: access to the chemical structure information; access to related information for contextualization and access to the full text of the article. Currently we have a real "mixed bag" of accessibility. No journals currently give direct access to chemical structure and ontology information except the very limited RSC Project Prospect . Open Access journals give free access to the full text of the article, but the number in chemistry is very limited (e.g. Chemistry Central Journal ), although a major development is the mandate that all government funded research publications be made freely available after a year, and the resultant PubMed Central archive. Some articles are still conspicuously missing (e.g. ACS journals). Even if an article is physically available, it may not be in a format which is suited for machine reading (e.g. PDF).