We have previously discussed the use of 2D and 3D descriptors to characterize compounds, and how these can be used in similarity calculation, clustering, diversity, and so on. In this section we are concerned with ways of correlating these descriptors with outcomes, such as biological activities, properties, and toxicity, and the building of predictive models based on these descriptors and correlations.

The establishment of structure-activity relationships (SAR) in medicinal chemistry predates the use of computers in chemistry, and relies on correlating structural features with experimental results for multiple compounds, usually in the same series. It is common in medicinal chemistry to use synthesis techniques to create several related compounds (e.g. methyl-, ethyl-, butyl- forms), and then to investigate the effect of these synthetic changes on a particular property or biological activity (so we might find, for instance, that extending the Methyl chain reduces a particular activity). The relationship between structure and activity may or may not be quantified.

Quantitative Structure-Activity Relationships (QSAR) were originally designed as an attempt to add some mathematical basis to this process, particularly to define the activity as some function of descriptors (note that when the activity is a property or a toxicity, this is sometimes referred to as QSPR and QSTR respectively). If we develop a function that relates descriptors to a particular activity, we can then use the function predictively for compounds where the activity is unknown but the descriptors can be calculated.

The earliest examples of QSAR were Hansch analysis and Free-Wilson Analysis, which are actually applications of linear regression . Hansch analysis pertained to property descriptors, and Free-Wilson, which we shall discuss here, to structural descriptors. Free-Wilson defined a function that equates activity (defined as log of 1 / the concentration) with weighted descriptors, the weightings, or coefficients, being determined by linear regression. That is, we have the equation:

Log (1/C) = a1x1 + a2x2 + a3x3 ...

where C is the concentration required for activity, x1, x2, x3, etc are the descriptor values (usually 1 or 0 to represent absence or presence of features), and a1, a2, a3, etc are the coefficients derived from linear regression. Linear regression is a generalized technique that aims to optimize the coefficients applied to independent variables so that the dependent variable (in this case Log 1/C) most closely matches the observed value for a set of descriptors. Thus one an think of a regression equation being trained using data with known dependent values, and then being applied predictively to data with unknown dependent values. Linear regression works by minimizing the sum of the differences between the values predicted by the equation and the actual observation. This is nicely illustrated in a java applet .

If a regression equation is to be used predictively, then we need some way of gauging its accuracy. The simplest way to do this is with r-squared , which is the proportion of the variance in the dependent variable that is explained by the regression equation (i.e. if r-squared = 1.0, then all the actual points lie on the regression line; if r-squared = 0.0, then the variance around the regression line is as high as the overall variance of the dependent variable).

There is a problem though with r-squared: the same data that is used to build the equation is also used to evaluate it. This can be addressed using q-squared (sometimes called crossvalidated r-squared). Here, we make n versions of the equation, each build leaving one of the original known values out (it is thus an example of leave-one-out validation); the q-squared is then the mean overall variance in using the equation to predict the values left out. q-squared is always thus less than r-squared.

Nonlinear approaches to QSAR

The main drawback of these early approaches are that they assume that the activity varies linearly with the descriptor values that affect it. However, this is usually not the case. Nolinear approaches still try to correlate descriptors and outcomes, but do not make this assumption. They are thus at least theoretically more useful, although there is usually some trade-off (such as speed, scalability or interpretability). Nonlinear approaches are generally an example of machine learning , particularly supervised learning (as opposed to unsupervised methods such as clustering; however unsupervised methods such as self-organizing maps may also be employed). The method used will also sometimes depend on the kind of QSAR that is to be determined - particularly there is a difference between classification problems (such as predicting whether compounds are active or inactive) and quantitative prediction problems (where we want to predict an activity value). Some of the most frequently-used nonlinear methods for QSAR are:

Different methods have different strengths and weakensses: for example neural nets are a "black box" approach and thus are not useful if we want to know why a particular prediction was made. Decision Trees are only usable for classification problems.

Regardless of the method use, building a model will generally be done in three phases: training (presenting known data to build the model); validation (testing the model with known data that has not been presented to build the model, such as a validation set); and prediction (using the model for truly unknown data). This is illustrated below.

Effective evaluation of models

If predictive models are to be properly evaluated there are a few basic principles that should be adhered to:

For publication. public datasets should be used, and the method and descriptors used should be made freely available or be described well enough that a reader could replicate the experiment

A validation set should always be used, and any success statistics should be based on the validation set, not the training set

For classification problems, always create a confusion matrix . From this, you can derive measures like sensitivity and specificity , or precision and recall . For large sets, particularly for virtual screening applications, it is appropriate to show a ROC Curve (one can also calculate AUC, or area under curve).

## Quantitative Structure-Activity Relationships (QSAR)

The establishment of structure-activity relationships (SAR) in medicinal chemistry predates the use of computers in chemistry, and relies on correlating structural features with experimental results for multiple compounds, usually in the same series. It is common in medicinal chemistry to use synthesis techniques to create several related compounds (e.g. methyl-, ethyl-, butyl- forms), and then to investigate the effect of these synthetic changes on a particular property or biological activity (so we might find, for instance, that extending the Methyl chain reduces a particular activity). The relationship between structure and activity may or may not be quantified.Quantitative Structure-Activity Relationships(QSAR) were originally designed as an attempt to add some mathematical basis to this process, particularly to define the activity as some function of descriptors (note that when the activity is a property or a toxicity, this is sometimes referred to as QSPR and QSTR respectively). If we develop a function that relates descriptors to a particular activity, we can then use the functionpredictivelyfor compounds where the activity is unknown but the descriptors can be calculated.The earliest examples of QSAR were

Hansch analysisandFree-Wilson Analysis,which are actually applications of linear regression . Hansch analysis pertained to property descriptors, and Free-Wilson, which we shall discuss here, to structural descriptors. Free-Wilson defined a function that equates activity (defined as log of 1 / the concentration) with weighted descriptors, the weightings, or coefficients, being determined by linear regression. That is, we have the equation:Log (1/C) = a1x1 + a2x2 + a3x3 ...where C is the concentration required for activity, x1, x2, x3, etc are the descriptor values (usually 1 or 0 to represent absence or presence of features), and a1, a2, a3, etc are the coefficients derived from linear regression. Linear regression is a generalized technique that aims to optimize the coefficients applied to independent variables so that the dependent variable (in this case Log 1/C) most closely matches the observed value for a set of descriptors. Thus one an think of a regression equation being

trainedusing data with known dependent values, and then beingappliedpredictively to data with unknown dependent values. Linear regression works by minimizing the sum of the differences between the values predicted by the equation and the actual observation. This is nicely illustrated in a java applet .If a regression equation is to be used predictively, then we need some way of gauging its accuracy. The simplest way to do this is with r-squared , which is the proportion of the variance in the dependent variable that is explained by the regression equation (i.e. if r-squared = 1.0, then all the actual points lie on the regression line; if r-squared = 0.0, then the variance around the regression line is as high as the overall variance of the dependent variable).

There is a problem though with r-squared: the same data that is used to build the equation is also used to evaluate it. This can be addressed using q-squared (sometimes called crossvalidated r-squared). Here, we make n versions of the equation, each build leaving one of the original known values out (it is thus an example of leave-one-out validation); the q-squared is then the mean overall variance in using the equation to predict the values left out. q-squared is always thus less than r-squared.

## Nonlinear approaches to QSAR

The main drawback of these early approaches are that they assume that the activity varies linearly with the descriptor values that affect it. However, this is usually not the case. Nolinear approaches still try to correlate descriptors and outcomes, but do not make this assumption. They are thus at least theoretically more useful, although there is usually some trade-off (such as speed, scalability or interpretability). Nonlinear approaches are generally an example of machine learning , particularly supervised learning (as opposed to unsupervised methods such as clustering; however unsupervised methods such as self-organizing maps may also be employed). The method used will also sometimes depend on the kind of QSAR that is to be determined - particularly there is a difference between classification problems (such as predicting whether compounds are active or inactive) and quantitative prediction problems (where we want to predict an activity value). Some of the most frequently-used nonlinear methods for QSAR are:Different methods have different strengths and weakensses: for example neural nets are a "black box" approach and thus are not useful if we want to know

whya particular prediction was made. Decision Trees are only usable for classification problems.Regardless of the method use, building a model will generally be done in three phases: training (presenting known data to build the model); validation (testing the model with known data that has not been presented to build the model, such as a validation set); and prediction (using the model for truly unknown data). This is illustrated below.

## Effective evaluation of models

If predictive models are to be properly evaluated there are a few basic principles that should be adhered to: