Any machine learning algorithm can make predictions about a quantity of interest, whether it be bioactivity, lipophilicity or solubility. But, how much confidence should you have in the model?
Several quality indicators for predictive models exist; for example, how well it performs in cross validation. Cross validation is one of the key techniques used to identify whether a machine learning algorithm has “overfit” the data. This can give a misleadingly good measure of the model quality, which will not be reflected when it is used in a real project setting. While cross-validation may appear to be a straightforward concept, it can have pitfalls in application.
Unravelling cross validation
Cross validation is an approach to estimate the quality of a model built using machine learning. This method divides data into a training set – which the model is directly trained on – and a small subset of the data: the validation set. These validation sets are hidden from the machine learning algorithm during training. Predictions, made on this validation set, are a less biased estimate of the quality of the model.
Why is this important?
Take a look at figure one (below). Here you can see some data points – blue ones, in the training set and a red one, in the hidden validation set.
- The blue points are reasonably well fitted by a simple straight line (blue).
- The algorithm could fit a more complex model (the green polynomial), that would fit the training data better (with a residual of zero).
- Note that the prediction from the green model is much worse than the blue one, when tested against the hidden data (the red point). This indicates that the algorithm has fitted noise (e.g. experimental error) in the data during training. A good model should have similar residual values in the validation and training set and divergence indicates a problem.
However, not all validation sets are equal, and care must be taken in making sure they are fit for purpose. One could simply take a random subset of points from our data set to make the validation set, and then job done? Well, maybe not. Unfortunately, a problem may arise from the distribution of the underlying data – which may be far from random.
Taking a closer look
Consider a set of molecules with associated Ki values extracted from a database. There may be very similar analogues in there and if one is assigned to the training set and one assigned to the validation set, the independence of the validation molecule will be compromised by its twin in the training set. It is analogous to fitting a curve where one point is extremely close to another on the explanatory axis.
No matter what model you use, the training set twin will closely constrain the predicted value for the validation twin and give a flattering result. This problem can be reduced by setting minimum similarity thresholds between compounds in the two sets.
A well-known issue with machine learning models is that they often work well on retrospective data, but when challenged with new molecules to make predictions, the quality of predictions deteriorates. The problem here can be time travel.
Suppose we are working on a project running for 18 months with hundreds of compounds made and tested and we make a naïve Bayesian model with a 10% random validation set. Do the quality metrics reflect reality? Not really.
That training set will include molecules synthesised right at the end of the project and we are using this information to train the model and make predictions about molecules synthesised earlier.
Imagine, after a year into the project, that the project team found that a key amide could be replaced with a sulfphonamide with an increase in potency (miracles can happen). A Bayesian model built at the time, without this knowledge, would not have predicted this substitution, would be active and the sulfonamide bits would add little to the score. From our position of extreme hindsight though, there will be plenty of sulfonamides in the training set meaning, we can make a great prediction on a compound made six months ago.
The problem here is that you are not only including information which would not have been available at the time, but also sampling a wider chemical space than you realistically are able to. In a real project setting, you are typically extrapolating into new chemical space to make predictions. Whilst a retrospective model – which often interpolates the values that were removed to make the validation set and interpolation – is often more accurate. Time dependent splits of the data can reduce this problem.
Another problem is inadvertent fitting of the validation set which reduces its effectiveness as an estimator of model quality.
You may test many parallel machine learning models with different hyperparameters and use the validation performance to compare models and select the best. In doing this, you are indirectly fitting your model to the validation set. Training multiple models will generate fluctuations in the quality of validation set predictions, but by removing the worst models you are using the information within the validation set to drive model selection to the best agreement with the validation set. It is natural selection in action where the selective pressure is validation set performance.
This can be addressed by setting aside a third set of hidden data (the test set) that is only brought out right at the end of the modelling phase to estimate true model performance and this is an important step if many models are being tested.
Cross validation is one of the most powerful methods we have to build quality models and assess model quality. By looking at how the cross-validation was done, and the difference in training set and validation/test set residuals, you can gain an insight into the global quality of the model and add another piece to the picture of how much confidence you should place in its predictions.
About the author
Dr Andrew Pannifer is Lead Scientist in Cheminformatics at Medicines Discovery Catapult.
After a PhD in Molecular Biophysics at Oxford University, mapping the reaction mechanism of protein tyrosine phosphatases, he entered the pharmaceutical industry in 2002. Firstly at AstraZeneca and then at Pfizer, he performed structure-based drug design and crystallography, and in 2010 joined the CRUK Beatson Institute Drug Discovery Programme to start up Structural Biology and Computational Chemistry.
In 2013 he moved to the European Lead Factory as the Head of Medicinal Technologies to start up cheminformatics and modelling and also to work with external IT solutions providers to build the ELF’s Honest Data Broker system for triaging HTS output.