The power of Artificial Intelligence to drive cars, accurately diagnose diseases, recognise cats on the internet, and augment human faces with cartoon rabbit ears is driving a huge wave of interest in the field – Including how it can enable drug discovery. Investment inevitably tends to follow, and consequently the tendency to rebrand many techniques as “AI” and use the sledgehammer of deep neural networks in situations where simpler, faster techniques can work equally well.
Precisely what constitutes AI sparks debate and quickly reaches philosophy, but how does the current wave of developments differ from previous advances in early stage drug discovery at a practical level?
Machine learning and statistics have been used for decades in drug discovery. Z statistics run through many aspects of assay design and underlie tests for the power of an assay to identify active compounds. Use of Bayesian machine learning approaches, and others, to predict compound activities by learning from the measured activities of similar molecules is well established.
All these methods existed before the current AI boom took place, and before logistic regression and automatically applying a confidence threshold became “AI”. Is there really anything new?
Perhaps the distinguishing feature of recent AI developments in early stage drug discovery, compared with machine learning routinely applied in the 1990s / 2000s, is the ability to automatically extract the features from the data that are important for explaining the problem in hand. Traditionally, a user would heavily process the data to reduce the data to defined features that were expected to describe the problem. Different algorithms might be run on images to detect edges or identify particular shapes and the results analysed and passed to the machine learning to classify the image. Molecules were reduced to sets of substructures that were expected to represent the important features and explain the data.
Now, deep neural networks are able to perceive key features that explain the data themselves with far less bias from human input. Raw representations of molecules, plain text and images can be fed directly to the AI, sometimes with astonishing results.
After training classifiers on 50-100 documents, large numbers of documents can be prioritised for interest automatically. The rules of chemistry can be learned by a neural network from 50,000 diverse molecules simply represented as plain text and large numbers of novel molecules generated to propose synthetic ideas. Binding sites on protein surfaces and interactions with ligands can be characterised without the traditional simplistic identification of hydrogen bonds and hydrophobic interactions.
Examples where not just a single step but an entire process can be performed by an AI are now emerging. Entire multistep chemical syntheses can be designed with quality approaching that of a skilled human.
The new developments in AI are already finding practical uses in drug discovery and their application will grow, particularly where complex non-linear relationships in the data exist, and where data is difficult to reduce to defined features. Random Forests, XGBoost, Naïve Bayes and other approaches remain highly competitive in many situations though, and represent a high bar for the newer developments to beat.
About the author
Dr Andrew Pannifer is Lead Scientist in Cheminformatics at Medicines Discovery Catapult.
After a PhD in Molecular Biophysics at Oxford University, mapping the reaction mechanism of protein tyrosine phosphatases, he entered the pharmaceutical industry in 2002. Firstly at AstraZeneca and then at Pfizer, he performed structure-based drug design and crystallography, and in 2010 joined the CRUK Beatson Institute Drug Discovery Programme to start up Structural Biology and Computational Chemistry.
In 2013 he moved to the European Lead Factory as the Head of Medicinal Technologies to start up cheminformatics and modelling and also to work with external IT solutions providers to build the ELF’s Honest Data Broker system for triaging HTS output.