Representing 2D chemical structures on computer


Historic ways of representing chemicals


  • Trivial name, e.g. Baking Soda, Aspirin, Citric Acid, etc. Identifies the compound, but gives no (or little) information about what it consists of
  • Chemical formula, e.g. C6H12O6. Specifies the type and quantity of the atoms in the compound, but not its structure (i.e. how the atoms are connected by bonds)
  • Systematic name, e.g. 1,2-dibromo-3-chloropropane. Identifies the atoms present and how they are connected by bonds.
  • 2D chemical structure diagram
  • 3D chemical structure diagram

Computer representations


Early pioneers in using computers in chemistry considered two questions: how do we communicate structural information between humans and (text-only) computers? and how do we represent the atoms and bonds in a molecule one they are stored internally on a computer?. The answer to the former question was line notations: clever ways of representing 2D structures in a text string. The earliest example was Wiswesser Line Notation , followed by Beilstein's ROSDAL (which is still used in a limited fashion today). Early work was also done on ways of using linear notations for indexing structures, including the Lawson Number . .

Today, linear representations are extremely useful, not because computers can only work in text, but because text is still the most efficient way of storing and communicating information. The most popular current linear representations are SMILESand InChI , although some others are in use, such as SLN. Here is an example of the SMILES for a common drug:

tylenol2.gif

A great tool for trying out SMILES is the Daylight Depict tool or the Molinspiration JME editor. Linear notations are not the only way of communicating structure: also popular are file-based formats such as MDL's MOL/SDF and CML (a variant of XML). These have the advantage of flexibility, although they are much more verbose.

Internal representation for 2D structures is the same as one would represent a mathematical graph (which is useful - see later!). The atom lookup table assigns a unique number to each atom, along with listing other properties such as atomic type; the connection table is an adjacency matrix which shows which atoms are bonded to which other atoms, the bond order being indicated by the number in a cell (i.e. 1=single bond, 2=double bond, 3=triple bond). By convention, a 4 can be used for an "aromatic" bond. Here is an example atom lookup table and connection table for Acetaminophen (Tylenol, Paracetamol):

ctable.gif
Note that if we need to ensure that the same molecule is numbered the same way each time, we need an algorithm that consistently numbers atoms via rules. Fortunately, this can be done with the Morgan Algorithm (see Leach & Gillet). In this algorithm, each atom is given a "connectivity value" reflecting how many atoms it is connected to. This value is iteratively replaced by the sum of the connectivity values of its neighbors, until the number of different values is maximized. Atoms are then numbered in decreasing order of connectivity value. In the case of a tie, other properties are used (e.g. atomic number, bond order, etc). Doing this is an important basis for producing canonical representations, e.g. canonical SMILES.

Representation nuances


We now have some neat, simple ways of representing and communicating 2D chemical structures. However, there are some nuances of chemistry that complicate matters. In particular, stereochemistry, aromaticity and tautomers :

nuances.gif
Most representations don't inherently store stereochemical information, and we have a policy decision about whether we actually want to differentiate stereoisomers (in some instances, such as thalidomide , it makes a life or death difference!). This can be done at the representation level, or the computation level. Stereoisomerism is addressed in Isomeric SMILES and InChI .

For aromaticity, it is not always entirely clear whether a ring should be considered "aromatic" or not, and even if so, it may be represented as alternating single or double bonds, or in "aromatic" form. This can be addressed at the representation or computation level

For tautomerism, the same functional group can be represented differently, either through different conventions or to indicate a particular state (usually at a particular pH). Tautomerism is addressed in InChI.

The usefulness of graph theory


Graph Theory is a branch of mathematics that is used to model graphs - objects (nodes) with links between them (edges).
How does this apply to chemical structures? Well, if we consider atoms as nodes and bonds as edges, we have access to a large number of graph theory algorithms: for example comparing two chemical structures to see if they are the same becomes a graph isomorphism problem; determining if a chemical structure contains a given substructure becomes asubgraph isomorphism problem, solvable with the Ullman algorithm (see Leach & Gillet).

Representing reactions


Structural representations of reactions need to identify only the arrangement of products and reagents, and possibly which reagent atom maps to which product atom; other information such as stoichiometry and yield are generally stored separately. Reaction SMILES is a superset of SMILES with symbols for arrows and to separate components of the reaction. SMIRKS is a superset of Reaction SMILES that allows mapping of individual atoms. Note that Reaction SMILES and SMIRKS are languages for representing transformations, which may or may not be valid reactions. For example a common use for SMIRKS is representing generic reaction rules.

Representing generic (Markush) structures


Genericized forms of chemical structures are thought to have been first introduced by Eugene Markush in 1924 as part of a patent (prior to that, patents were for specific structures). Thus the term "Markush structures" came to be used for 2D representations that describe more than one actual structure (for example, by enumerating alternate groups on particular points of the molecule). Representing generic structures is difficult because a Markush structure can represent an unlimited number of compounds (e.g. "aryl group"). However this problem has been addressed with text-based languages for describing generic structures, such as GENSAL, and extended connection table representations for internal use. They are widely used in patent searching systems. We will be looking at Markush structures in more detail in a later class.