Integrative algorithms for cheminformatics & chemogenomics

David J. Wild, Indiana University School of Informatics and Computing
These materials are prepared for a tutorial at the Great Lakes Bioinformatics 2011 conference. A paper will also be submitted to the Journal of Cheminformatics.

Overview of cheminformatics methods and algorithms

This is a subset of topics covered on ICEP (for a full list, see the main page)

Chemogenomics: chemicals are not atomic!

Many algorithms, tools and analysis have already been applied by the bioinformatics community that involve chemical structures, or by extension drugs. These studies mostly treat chemical compounds as atomic entities (i.e. terminal nodes on a graph). However, the action of compounds on biological systems is determined by their constituent atoms and how they are arranged. Thus questions such as: “what kind of chemical structures are associated with Breast Cancer?”, or “what chemical structural features are needed for a compound to be a kinase inhibitor?” are not answered by these studies, but can be when cheminformatics representations and algorithms are employed. Below we discuss several ways in which such techniques can be applied to improve and extend bioinformatic analyses.

Predicting ligand-binding to protein targets

Experimental methods such as High Throughput Screening provide fast automated ways to testing the activity of chemical structures against protein targets (purified) or cellular / tissue samples. Recent years have seen a vast increase in the amount of publicly-available data (including PubChem Bioassay and ChEMBL).

This data can be used to train predictive models (Random Forest, Bayes, SVM) using 2D or 3D chemical structure descriptors (classic QSAR scaled up - see for example Chen, B. and Wild, D.J. PubChem BioAssays as a data source for predictive models, Journal of Molecular Graphics and Modeling. 2010; 28, 420-426.). Molecular Docking also provides a well-established way of predicting the binding of chemical compounds to protein targets (and thus creating a link between compounds and targets). Note that QSAR can be used without a 3D protein structure; molecular docking requires a 3D structure (e.g. from the PDB). Both can be used for "virtual screening"

Polypharmacology and drug profiling

We can also use both experimental data and predictive models to "profile" chemical compounds against many targets - thus painting a picture of the effects of a chemical compound or drug on an entire system rather than just a single target. A full experimental data matrix can be used, or a partial experimental matrix with missing data imputed by predictive models, or a purely predictive matrix. There are several examples of this - we will look at Takigawa et al.

Using chemical similarity to probe protein function

This has been examined most famously in the SEA analysis (see Nature 462, 175-181) - sets of ligands of protein targets are compared for similarity, and this is used to create a measure of association between protein targets

Integrated chemical and biological networks

Chem2Bio2RDF: connecting compounds into biological networks. To explore this, we will refer to Chen, B., Dong. X., Jiao, D., Wang, H., Zhu, Q., Ding, Y., Wild, D.J. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics 2010, 11, 255.

Once we have these integrated networks, we can use graph theory to do things like association finding between compounds/drugs and genes, diseases, pathways, and so on. For example, see the association finding tool hosted at Indiana. We can also create new inferred relationships, and measure the strength of association between nodes based on various methods (see for example, ChemoHub).

We can also make networks that include the literature, see for example Wang, H., Ding, Y., Tang, J., Dong, X., He, B., Qiu, J., Wild, D.J. Finding complex biological relationships in recent PubMed articles using Bio-LDA. PLoS One, 6 (3), e17243

Tools you can Use

Cheminformatics software development toolkits

Searching and querying

Data mining & association finding

for more information, contact David Wild, or visit