Generalizing LLMs to Scientific Applications

Estimated Reading Time: 4 minutes

Sep 11, 2023

TL;DR

This week we briefly discuss evaluation of LLMs and then a recent article from IBM discussing the use of deep generative foundation models to design small-molecule inhibitors.

Better Evaluations for LLMs

I came across this meme recently and it has stuck in my head. A lot of LLM evaluations rely on the prompter knowing the answer. We need better evaluations for LLMs that preserve “double blinding,” where the prompt writer can’t inadvertently leak the answer into the prompt. Such blind evaluations will become increasingly critical when applying LLMs to scientific problems. In the absence of rigorous evaluations, it will be easy to overestimate the capabilities of LLMs. Overestimating LLM capabilities feeds into dangerous narratives of “AI doom” that have arisen recently.

Deep Generative Framework for Designing Small Molecule Inhibitors

Due to the complexity of molecular design, inhibitor discovery faces significant challenges. Traditional approaches like docking and molecular simulation methods are computationally expensive when screening a large number of compounds. Limited availability of critical information such as crystal structures of target protein or known inhibitors can also hinder explorations of chemical space. As a result, it can be challenging to design inhibitors even for targets with known structures. When only the sequence is known, we face additional challenges in modeling the protein structure.

IBM research recently published a deep generative framework, CogMol, to design small molecule inhibitors for two different targets —the spike protein receptor binding domain (RBD) and the main protease from SARS-CoV-2 using only protein sequence information.

*https://www.science.org/doi/10.1126/sciadv.adg7865*

CogMol uses a variational autoencoder (VAE) to learn a latent space representation of molecules and perform attribute conditioned sampling to generate molecules with desired properties. These molecules undergo in-silico screening for candidate prioritization.

The VAE encoder-decoder architecture used by the framework is trained on molecules represented by SMILES strings. The model is optimized by using a variational lower bound and is further trained to predict molecular attributes. Conditional Latent Space Sampling (CLaSS) is used for attribute-controlled molecule generation based on the molecule and protein sequence embedding. The generated molecules are filtered based on various criteria, such as molecular weight, chemical validity, and predicted affinity. Fewer than a thousand compounds were run through docking simulations. Molecular dynamics simulations were also run of binding sites to deepen understanding of the binding mode for some compounds.

IBM synthesized 4 potential inhibitors for Mpro predicted by the generative framework and found that 2 of them had inhibitory activity (43 and 34.2 μM). The fact that IBM was able to achieve 50% success at designing inhibitors for Mpro is impressive and provides a glimpse of a future in which inhibitor design is a reproducible engineering exercise. But it’s worth noting that IBM did not compare against alternative baseline methods (such as just performing docking, FEP, ligand-based methods or more), so we don’t know whether the generative model is more effective than standard techniques as a design framework. It’s also worth noting that human input was used to select the compounds to synthesize (Enamine helped narrow down from a set of 100 candidate compounds). For these reasons, while I like IBM’s general design, I suspect that we are still a long way yet from truly automated small molecule inhibitor design.

Interesting Links from Around the Web

https://www.nature.com/articles/d41586-023-02839-4: A new NSF center for uniting Indigenous knowledge with Western science.
https://www.tomshardware.com/news/samsung-says-it-will-beat-tsmc-to-4nm-production-in-the-us: Samsung is on track to produce 4nm chips in the US by the end of 2024

Feedback and Comments

Please feel free to email me directly (bharath@deepforestsci.com) with your feedback and comments!

About

Deep Into the Forest is a newsletter by Deep Forest Sciences, Inc. We’re a deep tech R&D company building Chiron, an AI-powered scientific discovery engine for the biotech/pharma industries. Deep Forest Sciences leads the development of the open source DeepChem ecosystem. Partner with us to apply our foundational AI technologies to hard real-world problems in drug discovery. Get in touch with us at partnerships@deepforestsci.com!

Credits

Lead Author: Bharath Ramsundar, Ph.D.

Editor: Sandya Subramanian, Ph.D.

Research and Writing: Rida Irfan

Deep into the Forest

Discussion about this post