Characterizing 2D structures with descriptors and fingerprints

We can create "descriptors" that tell us something about a chemical structure (such as whether or not it contains a particular functional group) and thus characterize it, but don't identify it. These are useful for a variety of purposes, especially for predictive modeling and calculation of similarity between structures. Some of the descriptors we can compute from the 2D structure include:

Simple feature counts (such as number of rotatable bonds or molecular weight)

Fragmental descriptors which indicate the presence or absence (or count) of actual or genericized substructures

Physicochemical properties

Topological indices, such as the Branching Index and the Chi Molecular Connectivity Indices

For a larger list, take a look at Leach and Gillet (Chapter 3) or the Molconn-Z Methods Manual .

Fragmental descriptors

Fragmental descriptors describe 2D structural features that are larger than one atom. These will often describe a specific substructure (such as a nitro group or carboxyllic As well as simple substructures, these fragments can be more complex constructs. Here are some examples:

One of the most common ways of implementing fragmental descriptors is to create a dictionary of substructures or features that are of interest, with each of these being mapped to a descriptor for a molecule that is either 0 if the feature is not present in a molecule, or non-zero if it is (either a 1 to indicate presence, or a count of the number of times it is present). Here is an example (very small) dictionary of fragment features

Descriptor

Fragment

1

C(=O)OH

2

CCCCN

3

S(=O)(=O)N

4

O= ...4 ... HET

So for the example molecule below, all would be set to "1" except for #3 which would be "0".

Physicochemical properties

These are physical and chemical properties of a molecule which can be estimated by examination of the 2D structure of a molecule. One of the most common, is LogP , the logarithm of the partition coefficient of a compound between water and octanol, and is thus a measure of "oilyness" of a molecule (the opposite of how polar it is). Molecules need to be within certain ranges of LogP to cross through cellular membranes and into the Central Nervous System. There are a variety of methods of estimated LogP mostly involving additive rules based on fragments present in a compound. Solubility is another common property that can be estimated from the structure

Topological indices

Topological indices are single-value descriptors that reflect something about the nature of the chemical structure graph. One of the simplest (and earliest) was the Wiener Index , which is simply 0.5 x the sum of the number of bonds between all pairs of atoms. A development of the Wiener Index, Molecular Connectivity Indices include the Randic Branching Index and the Kier and Hall Chi Molecular Connectivity Indices. More details of these can be found in Leach & Gillet and also in the Molconn-Z Methods Manual.

Assembling descriptors into fingerprints

Once we have a set of descriptors, it is easy to assemble them into a "string" of descriptors that characterize a compound. These descriptors can be binary (1,0) numeric (integers, real numbers, etc) or categorical. In the cheminformatics world, we call these descriptor strings "fingerprints. Binary descriptors are especially useful, as there are highly efficient computer science algorithms that work with binary strings.

In the simplest case, there is a 1:1 relationship between descriptors and positions in a fingerprint. For instance, a common usage is to have a binary fingerprint of 2D fragmental descriptors where one bit position in the bit string is mapped to one dictionary item, and the bit value (1,0) determines presence or absence of that feature.

This kind of fingerprint is sometimes known as a structural key and the most famous example in cheminformatics is the MDL 166-key structural key (sometimes known as the MACCS or ISIS keys) which defines 166 fragments that are considered important in medicinal chemistry.

An alternative strategy for fragmental descriptors is so have a set of rules about the generation of descriptors (versus a dictionary), with descriptors being generated on-the-fly for a molecule or set of molecules. Examples of common rules include:

All atom sequences from 2-7 atoms

All augmented atoms

Circular substructures

When there is no dictionary, there is no obvious way to map these descriptors consistently to fingerprint bits. Further, the number of fragments generated can be huge (100,000 just for the 2-7 atom sequences for C,N,S,O,P, not considering bond types or generalizations). If we created a bit position for every possible descriptor, the fingerprints would be impossibly big, and extremely sparse. Therefore, we generally use a hashing algorithm to map these descriptors onto a fixed number of bits (e.g. 1,024), and these are called hashed fingerprints.

Commonly used binary fingerprints include MDL 166-keys, Scitegic EFCPs, Daylight, BCI, CDK fingerprints and Chemaxon.

Measuring similarity between fingerprints

The most common way of measuring similarity between two fingerprints is the Tanimoto Coefficient. In the case of a binary fingerprint, Tanimoto is identical to the better known Jaccard Index. Generally, this is the defined as the intersection of a set divided by the union of a set, and so has a value between 0 and 1. The binary variant is the most common, which is defined as C / (A+B-C) where C is the number of set bits in common, A is the number of set bits in fingerprint A, and B is the number of set bits in fingerprint B. For most fingerprints, a similarity >0.7 or 0.8 indicates that the molecules are similar enough to share biological properties. The measure loses any real meaning <0.3 or so.

Actually, more generally Tanimoto is a variant of the Cosine Coefficient which measures the angle between two vectors. For a non-binary case (i.e. using non-binary descriptors), Tanimoto is the dot product of the vectors (fingerprints) divided by the magnitude of fingeprint A + the magnitude of fingerprint B - the dot product. This collapses to Jaccard for binary fingerprints.

The second most common measure is Euclidean Distance , which is technically a measure of distance, not similarity. This is especially useful when the measure has to obey the triangle inequality (i.e. it is a metric ) although the Soergel Distance (1-Tanimoto) has been recently proven to obey the triangle inequality for positive descriptors . Note that for binary fingerprints, the Euclidean distance is the square root of the Hamming Distance .

Characterizing 2D structures with descriptors and fingerprintsWe can create "descriptors" that tell us something about a chemical structure (such as whether or not it contains a particular functional group) and thus characterize it, but don't identify it. These are useful for a variety of purposes, especially for predictive modeling and calculation of similarity between structures. Some of the descriptors we can compute from the 2D structure include:

For a larger list, take a look at Leach and Gillet (Chapter 3) or the Molconn-Z Methods Manual .

Fragmental descriptorsFragmental descriptors describe 2D structural features that are larger than one atom. These will often describe a specific substructure (such as a nitro group or carboxyllic As well as simple substructures, these fragments can be more complex constructs. Here are some examples:

One of the most common ways of implementing fragmental descriptors is to create a dictionary of substructures or features that are of interest, with each of these being mapped to a descriptor for a molecule that is either 0 if the feature is not present in a molecule, or non-zero if it is (either a 1 to indicate presence, or a count of the number of times it is present). Here is an example (very small) dictionary of fragment features

So for the example molecule below, all would be set to "1" except for #3 which would be "0".

Physicochemical propertiesThese are physical and chemical properties of a molecule which can be estimated by examination of the 2D structure of a molecule. One of the most common, is LogP , the logarithm of the partition coefficient of a compound between water and octanol, and is thus a measure of "oilyness" of a molecule (the opposite of how polar it is). Molecules need to be within certain ranges of LogP to cross through cellular membranes and into the Central Nervous System. There are a variety of methods of estimated LogP mostly involving additive rules based on fragments present in a compound. Solubility is another common property that can be estimated from the structure

Topological indicesTopological indices are single-value descriptors that reflect something about the nature of the chemical structure graph. One of the simplest (and earliest) was the Wiener Index , which is simply 0.5 x the sum of the number of bonds between all pairs of atoms. A development of the Wiener Index, Molecular Connectivity Indices include the Randic Branching Index and the Kier and Hall Chi Molecular Connectivity Indices. More details of these can be found in Leach & Gillet and also in the

Molconn-Z Methods Manual.

Assembling descriptors into fingerprintsOnce we have a set of descriptors, it is easy to assemble them into a "string" of descriptors that characterize a compound. These descriptors can be binary (1,0) numeric (integers, real numbers, etc) or categorical. In the cheminformatics world, we call these descriptor strings "fingerprints. Binary descriptors are especially useful, as there are highly efficient computer science algorithms that work with binary strings.

In the simplest case, there is a 1:1 relationship between descriptors and positions in a fingerprint. For instance, a common usage is to have a binary fingerprint of 2D fragmental descriptors where one bit position in the bit string is mapped to one dictionary item, and the bit value (1,0) determines presence or absence of that feature.

This kind of fingerprint is sometimes known as a

structural keyand the most famous example in cheminformatics is the MDL 166-key structural key (sometimes known as the MACCS or ISIS keys) which defines 166 fragments that are considered important in medicinal chemistry.An alternative strategy for fragmental descriptors is so have a set of rules about the generation of descriptors (versus a dictionary), with descriptors being generated on-the-fly for a molecule or set of molecules. Examples of common rules include:

When there is no dictionary, there is no obvious way to map these descriptors consistently to fingerprint bits. Further, the number of fragments generated can be huge (100,000 just for the 2-7 atom sequences for C,N,S,O,P, not considering bond types or generalizations). If we created a bit position for every possible descriptor, the fingerprints would be impossibly big, and extremely sparse. Therefore, we generally use a

hashing algorithmto map these descriptors onto a fixed number of bits (e.g. 1,024), and these are calledhashed fingerprints.Commonly used binary fingerprints include MDL 166-keys, Scitegic EFCPs, Daylight, BCI, CDK fingerprints and Chemaxon.

Measuring similarity between fingerprintsThe most common way of measuring similarity between two fingerprints is the Tanimoto Coefficient. In the case of a binary fingerprint, Tanimoto is identical to the better known Jaccard Index. Generally, this is the defined as the intersection of a set divided by the union of a set, and so has a value between 0 and 1. The binary variant is the most common, which is defined as C / (A+B-C) where C is the number of set bits in common, A is the number of set bits in fingerprint A, and B is the number of set bits in fingerprint B. For most fingerprints, a similarity >0.7 or 0.8 indicates that the molecules are similar enough to share biological properties. The measure loses any real meaning <0.3 or so.

Actually, more generally Tanimoto is a variant of the Cosine Coefficient which measures the angle between two vectors. For a non-binary case (i.e. using non-binary descriptors), Tanimoto is the dot product of the vectors (fingerprints) divided by the magnitude of fingeprint A + the magnitude of fingerprint B - the dot product. This collapses to Jaccard for binary fingerprints.

The second most common measure is Euclidean Distance , which is technically a measure of distance, not similarity. This is especially useful when the measure has to obey the triangle inequality (i.e. it is a metric ) although the Soergel Distance (1-Tanimoto) has been recently proven to obey the triangle inequality for positive descriptors . Note that for binary fingerprints, the Euclidean distance is the square root of the Hamming Distance .

For an overview of similarity measures, see Chemical Similarity Searching

For a detailed analysis of fingerprint statistics, see Ties and Proximity and Clustering Compounds .