Now we know how to represent 2D chemical structures and characterize them with descriptors, we have most of the pieces we need to build databases of 2D chemical structures and information about them. However, there are still some questions we need to address:

  • What kind of database technologies do we use?
  • What kinds of specialized searching are required, and how do we implement them?
  • What kinds of interfaces do we need?

Database technologies


Since we can represent a chemical structure as a text string (SMILES, InChI), we can at a basic level use just about any database software to store structures. However, unless we have the ability to do specialized searching (see below) the database won't be much use. There are two approaches: using specialized database management systems specifically written for handling chemical structures, and using generic database management systems customized to allow specialized searching and processing.

Traditionally, the only available databases were entirely specialized. One of the first was MDL MACCS (1979) which offered storage and searching of databases of 2D structures, but little else. It became more useful in 1985, when it was integrated into ISIS which allowed chemical structure information to be stored in one database, and non-structural information in a separate Oracle database, with a system for integrated searching between the two. Interfaces for computers were developed that allowed structure drawing and display, and querying of these databases. The general architecture of this was:

ISIS-Architecture.jpg

This is known as a client/server system, where we have a server receiving queries from and sending results to applications running on client computers. Note that this requires installation of specialized software on the client machines as well as on the server. Several other client/server database management systems emerged, including Daylight Merlin/Thor and Tripos Unity. Separately, client-only systems were developed for handling lists of structures locally on machines, such as Accord for Excel. Also, chemistry toolkits such as OEChem and CDK allow some degree of back-end handling of databases.

A major shift started when Oracle released version 8i of its SQL-based database management system, which allowed "cartridges" to be written that extended the functionality of the database in a very flexible way. Quickly, it was realized that the kinds of specialized storage and searching required for 2D structure storage and retrieval could be implemented in cartridges , thus enabling a generic Oracle SQL database to be used for chemical structure searching. A good example of how Daylight implemented such a cartridge is given in Jack Delaney's MUG2000 talk . A similar system (called the datablade) was developed for IBM's Informix database, but this has not been widely used in cheminformatics. However, a similar kind of "plug-in" for the free database PostgreSQL has been exploited, for example with gNova 's CHORD. There are currently many Oracle or PostgreSQL cartridges available, including:


Note that cartridges don't offer any interfaces per se, so nowadays these products are often packaged with interface tools and client programs.

Both Oracle and PostgreSQL are relational databases that use the SQL language for querying.

Generally, all of these tools cost money, but see some alternatives, for example Creating a Web-based, Searchable Molecular Structure Database Using Free Software ,

Specialized searching


There are three commonly used types of searching that need to be implemented in 2D chemical structure databases:

  • Structure searching, i.e. answering the question "is this structure in the database?"
  • Substructure searching, i.e. "find me all of the structures that contain this substructure"
  • Similarity searching, i.e. "find me the structures which are similar to this one"

Note that these might often be combined with regular text or numeric searches, e.g. "find me all of the compounds containing this substructure that are active in this assay and have a LogP < 5".

If we are using canonical SMILES (and make sure the query is specified using the same canonicalization algorithm), then structure searching can be as simple as doing a text search for the SMILES in the database (although even with canonicalization minor variants would be missed). A more reliable substructure search can be carried out using a graph isomorphism algorithm.

Substructure searching requires implementation of a subgraph isomorphism algorithm such as Ullman.

Similarity searching is effected through the use of similarity or distance coefficients and descriptors as described previously.

Additionally, one can speed up structure and similarity searching by pre-search screening using fragmental descriptors (indeed this is what they were originally designed for) to exclude compounds that don't include features in the query (substructure search) and those whose maximum similarity (based on features present) to a query would be greater than a specified cutoff (similarity searching)

Representing substructure search queries - SMARTS


One issue is how we go about representing queries for substructure searching. For many substructures, we can simply use SMILES as if the substructure were a full structure. However, we often want to add features to substructures that we wouldn't have in a regular structure, such as specifying attachment points, bonds to undefined atoms, and ambiguity about atom and bond types. For example, we might want to search for a ring system that attaches to the rest of a molecule only at particular specified points.

Fortunately we have several ways to do this. For example, the MDL MOL/SD file is extendible to represent query features. Of particular note is SMARTS, a superset of SMILES which is designed for representing queries. A simple example of SMARTS would be *C(=O)O - a carboxyllic acid, in this case differing from the SMILES only by the presence of an asterisk indicating an attachment point. In fact, SMARTS includes a wide range of extra characters for representing queries. Useful resources on the Daylight website include a SMARTS tutorial , SMARTS examples , and SMARTS practice , using the DepictMatch tool .

Interfaces


At the client side, some kind of interface is required for searching databases. This could be a machine interface (e.g. JDBC , ODBC , SOAP service,REST service) or a human interface (HTTP or client-side application). Increasingly database access through a single human interface is an outdated method; service-oriented architectures allow much greater flexibility to search within a variety of applications and mashups.

Client side interfaces need a method of displaying and drawing 2D structures. This can be done with a variety of toolkits (e.g. CDK, OEChem) and applets and plug-ins (e.g. Chemdraw Plugin, JME). It can even be done with a REST service (see http://cheminfo.wikispaces.com/Web+service+infrastructure)

Example with PostgreSQL and CHORD


For this example, we are going to be working with a very small sample dataset of common drugs:

sample-dataset2.gif

Create a new database to store SMILES, name, logp and fingerprint
create table gnovatest (smiles VARCHAR(200), name VARCHAR(50), logp real, fkey BIT(166));
 
Check it is there
\d gnovatest
 
Insert our 8 molecules into the database
INSERT INTO gnovatest (smiles, name, logp) VALUES ( 'CC(=O)Nc1ccc(O)cc1', 'Acetaminophen', 0.27 );
INSERT INTO gnovatest (smiles, name, logp) VALUES ( 'CC(C)NCC(O)COc1ccccc1CC=C', 'Alprenolol', 2.81 );
INSERT INTO gnovatest (smiles, name, logp) VALUES ( 'CC(N)Cc1ccccc1', 'Amphetamine', 1.76 );
INSERT INTO gnovatest (smiles, name, logp) VALUES ( 'CC(CS)C(=O)N1CCCC1C(=O)O', 'Captopril', 0.84 );
INSERT INTO gnovatest (smiles, name, logp) VALUES ( 'CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13', 'Chlorpromazine', 5.20 );
INSERT INTO gnovatest (smiles, name, logp) VALUES ( 'OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl', 'Diclofenac', 4.02 );
INSERT INTO gnovatest (smiles, name, logp) VALUES ( 'NCC1(CC(=O)O)CCCCC1', 'Gabapentin', -1.37 );
INSERT INTO gnovatest (smiles, name, logp) VALUES ( 'COC(=O)c1ccccc1O', 'Salicylate', 2.60 );
 
Use the gNoval public166keys function to create fingerprints from the SMILES and put them in the fkey field
update gnovatest set fkey = public166keys(smiles);
 
Take a look at what we have
select smiles,name,logp from gnovatest;
select * from gnovatest;
 
Do a substructure search for structures containing a Carboxyllic Acid
select smiles,name,logp from gnovatest where matches(smiles, '*C(=O)O');
 
Refine search to only those with LogP > 1
select smiles,name,logp from gnovatest where (matches(smiles, '*C(=O)O') AND (logp>1.0));
 
Do a similarity search with Aspirin
select smiles,name,logp from gnovatest where tanimoto(fkey, public166keys('CC(=O)Oc1ccccc1C(=O)O')) > 0.6;
 
we can validate the substructure search by looking at the results compared to the structures:

subsearch.gif
and for the similarity search:

simsearch.gif

Publicly available databases


There are many online public databases of chemical structures, of particular note:

PubChem
Chemspider
eMolecules

Also of interest is this Journal of Cheminformatics article: Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compound