Pharmaceutical companies are using artificial intelligence to simplify the process of discovering new drugs. Machine learning models can propose new molecules with specific characteristics that can fight some diseases and complete manual operations that may take humans months in a few minutes. However, there is a major obstacle to the development of these systems These models often suggest that new molecular structures are difficult or impossible to produce in the laboratory. If chemists cannot actually make molecules, they cannot test their disease resistance
A [new method] for MIT researchers( https://arxiv.org/abs/2110.06389 ) The machine learning model is limited, so it only suggests the molecular structure that can be synthesized. This method ensures that the molecules are composed of materials that can be purchased, and the chemical reactions between these materials follow the chemical law.
Compared with other methods, the molecular structure proposed by their model has a high score or even higher in popular evaluation, and can be synthesized at the same time. Their system takes less than a second to propose a synthetic pathway, while other methods that individually propose molecules and then evaluate their synthetic ability may take a few minutes. In a search space with billions of potential molecules, these time savings will increase.
"This process redefines how we require these models to produce new molecular structures. Many of these models believe that we need to build new molecular structures atom by atom or bond by bond. Instead, we are building new molecules component by component and reaction by reaction," said Connor Coley, Henry slesinger associate professor of career development in the Department of chemical engineering and the Department of electronic engineering and computer science at the Massachusetts Institute of technology and senior author of the paper.
Coley also wrote the paper with the first author, graduate student Gao Wenhao and postdoctoral ROC í o Mercado. The study was recently presented at the International Conference on learning representation.
To create a molecular structure, the model simulates the process of synthesizing a molecule to ensure that it can be produced. The model obtains a set of feasible building blocks, that is, chemicals that can be purchased, and an effective list of chemical reactions for operation. These chemical reaction templates are handmade by experts. Controlling these inputs by allowing only certain chemicals or specific reactions allows researchers to limit the search space for a new molecule.
The model uses these inputs to build a tree, selecting components and connecting them through chemical reactions, one at a time, to build the final molecule. In each step, as more chemicals and reactions are added, the molecules become more complex.
It can output not only the final molecular structure, but also the chemicals and reaction trees that synthesize it. "Instead of directly designing the product molecule itself, we design a sequence of actions to obtain the molecule. This allows us to ensure the quality of the structure," the researchers said.
In order to train their models, the researchers input a complete molecular structure and a set of building blocks and chemical reactions, and the model learning creates a tree of synthetic molecules. After seeing hundreds of thousands of examples, the model learned to think of these synthetic approaches by itself.
Molecular optimization
The trained model can be used for optimization. Researchers define some characteristics they want to achieve in the final molecule. Given some components and chemical reaction templates, the model will propose a synthetic molecular structure.
"Surprisingly, with such a small set of templates, you can actually copy a large number of molecules. You don't need so many components to generate a large amount of available chemical space for model search," Mercado said.
They tested the model by evaluating its ability to reconstruct synthetic molecules. It can reproduce 51% of these molecules in less than a second. Their technique is faster than some other methods because the model does not search all options in every step of the tree. The researchers explained that it has a set of identified chemicals and reactions to work.
When they use their model to propose molecules with specific properties, their method proposes higher quality molecular structures that have stronger binding affinity than those of other methods. This means that these molecules will be able to better attach to proteins and block certain activities, such as virus replication.
For example, when proposing a molecule that can dock with sars-cov-2, their model proposed several molecular structures that may bind to viral proteins better than existing inhibitors. However, as the author acknowledges, these are only computational predictions.
"There are so many diseases that need to be solved," the researchers said. "I hope our approach can accelerate this process so that we don't have to screen billions of molecules for one disease target at a time. Instead, we can specify only the attributes we want, which can accelerate the process of finding the candidate drug."
Their model can also improve existing drug discovery pipelines. Mercado said that if a company has identified a specific molecule with the required characteristics, but cannot produce it, they can use this model to propose a very similar synthetic molecule.
Now that they have validated their approach, the team plans to continue to improve the chemical reaction template to further improve the performance of the model. With additional templates, they can conduct more tests on certain disease targets and eventually apply the model to the drug discovery process.
"Ideally, we want the algorithm to automatically design molecules and quickly provide us with synthetic trees at the same time," said Marwin segler, who led a team engaged in machine learning drug discovery at Microsoft Cambridge Research Institute (UK) and was not involved in the work. "This elegant approach of Professor Coley and his team is an important step to solve this problem. Although there was an early proof of concept of molecular design through synthetic tree generation, this team really made it work. They showed excellent performance on a meaningful scale for the first time, so it can have a practical impact in computer-aided molecular discovery."
"This work is also very exciting because it can finally realize a new paradigm of computer-aided synthesis planning. It is likely to have a great inspiration for future research in this field."