Leveraging language model for advanced multiproperty molecular optimization via prompt engineering

Wu, Zhenxing; Zhang, Odin; Wang, Xiaorui; Fu, Li; Zhao, Huifeng; Wang, Jike; Du, Hongyan; Jiang, Dejun; Deng, Yafeng; Cao, Dongsheng; Hsieh, Chang-Yu; Hou, Tingjun

doi:10.1038/s42256-024-00916-5

Article
Published: 21 October 2024

Leveraging language model for advanced multiproperty molecular optimization via prompt engineering

Nature Machine Intelligence volume 6, pages 1359–1369 (2024)Cite this article

6520 Accesses
10 Citations
5 Altmetric
Metrics details

Subjects

Abstract

Optimizing a candidate molecule’s physiochemical and functional properties has been a critical task in drug and material design. Although the non-trivial task of balancing multiple (potentially conflicting) optimization objectives is considered ideal for artificial intelligence, several technical challenges such as the scarcity of multiproperty-labelled training data have hindered the development of a satisfactory AI solution for a long time. Prompt-MolOpt is a tool for molecular optimization; it makes use of prompt-based embeddings, as used in large language models, to improve the transformer’s ability to optimize molecules for specific property adjustments. Notably, Prompt-MolOpt excels in working with limited multiproperty data (even under the zero-shot setting) by effectively generalizing causal relationships learned from single-property datasets. In comparative evaluations against established models such as JTNN, hierG2G and Modof, Prompt-MolOpt achieves over a 15% relative improvement in multiproperty optimization success rates compared with the leading Modof model. Furthermore, a variant of Prompt-MolOpt, named Prompt-MolOpt^P, can preserve the pharmacophores or any user-specified fragments under the structural transformation, further broadening its application scope. By constructing tailored optimization datasets, with the protocol introduced in this work, Prompt-MolOpt steers molecular optimization towards ___domain-relevant chemical spaces, enhancing the quality of the optimized molecules. Real-world tests, such as those involving blood–brain barrier permeability optimization, underscore its practical relevance. Prompt-MolOpt offers a versatile approach for multiproperty and multi-site molecular optimizations, suggesting its potential utility in chemistry research and drug and material discovery.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The real-world’s multiproperty and multi-site BBBP optimization case study of Prompt-MolOpt^P.**

**Fig. 2: The construction of the molecular optimization dataset.**

**Fig. 3: The overall workflow of Prompt-MolOpt.**

**Fig. 4: Overview of the molecular optimization framework of Prompt-MolOpt^P.**

Large-scale chemical language representations capture molecular structure and properties

Article 21 December 2022

MolE: a foundation model for molecular graphs using disentangled attention

Article Open access 12 November 2024

Knowledge graph-enhanced molecular contrastive learning with functional prompt

Article Open access 04 May 2023

Data availability

The datasets used in this study and the data generated in this study are available at https://github.com/wzxxxx/Prompt-MolOpt and https://doi.org/10.5281/zenodo.11080951 (ref. ⁴⁷). Source data are provided with this paper.

Code availability

The models were implemented using Python (v.3.6.13) with dgl(v.0.7.1) and PyTorch (v.1.6.0). The data processing and metrics calculation were implemented using Python (v.3.6.13) with scikit-learn (v.0.21.3), NumPy (v.1.19.2) and Pandas (v.1.1.5). The code for Prompt-MolOpt, Prompt-MolOpt^m and Prompt-MolOpt^P is publicly available at https://github.com/wzxxxx/Prompt-MolOpt and https://doi.org/10.5281/zenodo.11080951 (ref. ⁴⁷).

References

Fromer, J. C. & Coley, C. W. Computer-aided multi-objective optimization in small molecule discovery. Patterns 4, 100678 (2023).
Article Google Scholar
Nicolaou, C. A. & Brown, N. Multi-objective optimization methods in drug design. Drug Discov. Today Technol. 10, e427–e435 (2013).
Article Google Scholar
Jorgensen, W. L. Efficient drug lead discovery and optimization. Acc. Chem. Res. 42, 724–733 (2009).
Article Google Scholar
Leelananda, S. P. & Lindert, S. Computational methods in drug discovery. Beilstein J. Org. Chem. 12, 2694–2718 (2016).
Article Google Scholar
Zhang, X. et al. Efficient and accurate large library ligand docking with KarmaDock. Nat. Comput. Sci. 3, 789–804 (2023).
Article Google Scholar
Shen, C. et al. Boosting protein–ligand binding pose prediction and virtual screening based on residue–atom distance likelihood potential and graph transformer. J. Med. Chem. 65, 10691–10706 (2022).
Article Google Scholar
Maia, E. H. B., Assis, L. C., De Oliveira, T. A., Da Silva, A. M. & Taranto, A. G. Structure-based virtual screening: from classical to artificial intelligence. Front. Chem. 8, 343 (2020).
Article Google Scholar
Gentile, F. et al. Artificial intelligence-enabled virtual screening of ultra-large chemical libraries with deep docking. Nat. Protoc. 17, 672–697 (2022).
Choung, O.-H., Vianello, R., Segler, M., Stiefl, N. & Jiménez-Luna, J. Extracting medicinal chemistry intuition via preference machine learning. Nat. Commun. 14, 6651 (2023).
Article Google Scholar
Cheshire, D. R. How well do medicinal chemists learn from experience? Drug Discov. Today 16, 817–821 (2011).
Article Google Scholar
Shan, J. & Ji, C. MolOpt: a web server for drug design using bioisosteric transformation. Curr. Comput. Aided Drug Des. 16, 460–466 (2020).
Yang, H. et al. ADMETopt: a web server for ADMET optimization in drug design via scaffold hopping. J. Chem. Inf. Model. 58, 2051–2056 (2018).
Article Google Scholar
Dossetter, A. G., Griffen, E. J. & Leach, A. G. Matched molecular pair analysis in drug discovery. Drug Discovery Today 18, 724–731 (2013).
Article Google Scholar
Tu, Z. & Coley, C. W. Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction. J. Chem. Inf. Model. 62, 3503–3513 (2022).
Article Google Scholar
Jin, W., Barzilay, R. & Jaakkola, T. Multi-objective molecule generation using interpretable substructures. In Proc. 37th International Conference on Machine Learning 4849–4859 (PMLR, 2020).
Kong, D. et al. Dual-space optimization: improved molecule sequence design by latent prompt transformer. Preprint at https://arxiv.org/abs/2402.17179 (2024).
Shi, C. et al. GraphAF: a flow-based autoregressive model for molecular graph generation. Preprint at https://arxiv.org/abs/2001.09382 (2020).
Zang, C. & Wang, F. Moflow: an invertible flow model for generating molecular graphs. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 617–626 (ACM, 2020).
Jin W., Barzilay R. & Jaakkola T. Junction tree variational autoencoder for molecular graph generation. In Proc. 35th International Conference on Machine Learning 2323–2332 (PMLR, 2018).
Podda, M., Bacciu, D. & Micheli, A. A deep generative model for fragment-based molecule generation. In Proc. 23rd International Conference on Artificial Intelligence and Statistics 2240–2250 (PMLR, 2020).
Chen, Z., Min, M. R., Parthasarathy, S. & Ning, X. A deep generative model for molecule optimization via one fragment modification. Nat. Mach. Intell. 3, 1040–1049 (2021).
Floridi, L. & Chiriatti, M. GPT-3: its nature, scope, limits, and consequences. Minds Mach. 30, 681–694 (2020).
Article Google Scholar
Castro Nascimento, C. M. & Pimentel, A. S. Do large language models understand chemistry? A conversation with ChatGPT. J. Chem. Inf. Model. 63, 1649–1655 (2023).
Article Google Scholar
Guo, H., Zhao, S., Wang, H., Du, Y. & Qin, B. Moltailor: tailoring chemical molecular representation to specific tasks via text prompts. Preprint at https://arxiv.org/abs/2401.11403 (2024).
Ye, G. et al. DrugAssist: a large language model for molecule optimization. Preprint at https://arxiv.org/abs/2401.10334 (2023).
Zhou, K., Yang, J., Loy, C. C. & Liu Z. Conditional prompt learning for vision-language models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16816–16825 (IEEE, 2022).
He, Y. et al. HyperPrompt: prompt-based task-conditioning of transformers. Preprint at https://arxiv.org/abs/2203.00759 (2022).
Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55, 1–35 (2023).
Google Scholar
Zhang, X. et al. Clamp: prompt-based contrastive learning for connecting language and animal pose. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 23272–23281 (IEEE, 2023).
Teterwak, P., Sun, X., Plummer, B. A., Saenko, K. & Lim S.-N. CLAMP: contrastive language model prompt-tuning. Preperint at https://arxiv.org/abs/2312.01629 (2023).
Born, J. & Manica, M. Regression transformer enables concurrent sequence regression and generation for molecular language modelling. Nat. Mach. Intell. 5, 432–444 (2023).
Article Google Scholar
Seidl, P., Vall, A., Hochreiter, S. & Klambauer, G. Enhancing activity prediction models in drug discovery with the ability to understand human language. In Proc. 40th International Conference on Machine Learning 30458–30490 (PMLR, 2023).
Wu, Z. et al. Chemistry-intuitive explanation of graph neural networks for molecular property prediction with substructure masking. Nat. Commun. 14, 2585 (2023).
Article Google Scholar
Wu, Z. et al. Mining toxicity information from large amounts of toxicity data. J. Med. Chem. 64, 6924–6936 (2021).
Article Google Scholar
Jin, W,. Yang, K., Barzilay, R. & Jaakkola, T. Learning multimodal graph-to-graph translation for molecule optimization. In Proc. International Conference on Learning Representations 856 (ICLR, 2019).
Jin, W., Barzilay, R. & Jaakkola, T. Hierarchical generation of molecular graphs using structural motifs. In Proc. 37th International Conference on Machine Learning 4839–4848 (PMLR, 2020).
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).
Article Google Scholar
Delaney, J. S. ESOL: estimating aqueous solubility directly from molecular structure. J. Chem. Inf. Comput. Sci. 44, 1000–1005 (2004).
Article Google Scholar
Xu, C. et al. In silico prediction of chemical Ames mutagenicity. J. Chem. Inf. Model. 52, 2840–2847 (2012).
Article Google Scholar
Xiong, G., et al. ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Res. 49, W5–W14 (2021).
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Article Google Scholar
Cid, J. M. et al. Discovery of 3-cyclopropylmethyl-7-(4-phenylpiperidin-1-yl)-8-trifluoromethyl [1,2,4] triazolo [4,3-a] pyridine (JNJ-42153605): a positive allosteric modulator of the metabotropic glutamate 2 receptor. J. Med. Chem. 55, 8770–8789 (2012).
Article Google Scholar
Wildman, S. A. & Crippen, G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 39, 868–873 (1999).
Article Google Scholar
Wishart, D. S. et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46, D1074–D1082 (2018).
Article Google Scholar
Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform. 9, 48 (2017).
Article Google Scholar
Schlichtkrull, M. et al. Modeling relational data with graph convolutional networks. In Proc. 15th International Conference, ESWC 593–607 (Springer, 2018).
Zhenxing, W. et al. Leveraging language model for advanced multi-property molecular optimization via prompt engineering. Zenodo https://doi.org/10.5281/zenodo.11080951 (2023).

Download references

Acknowledgements

This study was financially supported by the National Key R&D Program of China (grant number 2021YFF1201400 to T.H.), the National Natural Science Foundation of China (grant numbers 22220102001 to T.H., 82404512 to Z.W. and 22303083 to J.W.), the Natural Science Foundation of Zhejiang Province of China (grant number LD22H300001 to T.H.), the China Postdoctoral Foundation (grant number 2023M742993 to Z.W.) and the Postdoctoral Fellowship Program of CPSF (grant number GZB20240671 to Z.W.).

Author information

Authors and Affiliations

Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, P. R. China
Zhenxing Wu, Odin Zhang, Huifeng Zhao, Jike Wang, Hongyan Du, Dejun Jiang, Chang-Yu Hsieh & Tingjun Hou
CarbonSilicon AI Technology Co. Ltd, Hangzhou, P. R. China
Zhenxing Wu, Odin Zhang, Huifeng Zhao, Jike Wang, Dejun Jiang & Yafeng Deng
Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, P. R. China
Xiaorui Wang
Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, P. R. China
Li Fu & Dongsheng Cao

Authors

Zhenxing Wu
View author publications
Search author on:PubMed Google Scholar
Odin Zhang
View author publications
Search author on:PubMed Google Scholar
Xiaorui Wang
View author publications
Search author on:PubMed Google Scholar
Li Fu
View author publications
Search author on:PubMed Google Scholar
Huifeng Zhao
View author publications
Search author on:PubMed Google Scholar
Jike Wang
View author publications
Search author on:PubMed Google Scholar
Hongyan Du
View author publications
Search author on:PubMed Google Scholar
Dejun Jiang
View author publications
Search author on:PubMed Google Scholar
Yafeng Deng
View author publications
Search author on:PubMed Google Scholar
Dongsheng Cao
View author publications
Search author on:PubMed Google Scholar
Chang-Yu Hsieh
View author publications
Search author on:PubMed Google Scholar
Tingjun Hou
View author publications
Search author on:PubMed Google Scholar

Contributions

T.H., C.Y.H., D.C. and Z.W. designed the research study. Z.W. developed the method and wrote the code. Z.W., O.Z., X.W., L.F., H.Z., J.W. H.D., D.J. and Y.D. performed the analysis. Z.W., T.H., C.Y.H. and D.C wrote the paper. All authors read and approved the paper.

Corresponding authors

Correspondence to Dongsheng Cao, Chang-Yu Hsieh or Tingjun Hou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Arvind Ramanathan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 The performance of Prompt-MolOpt on single-property optimization

Full size table

Extended Data Table 2 The performance of Prompt-MolOpt on multiproperty optimization

Full size table

Supplementary information

Supplementary Information

Supplementary Figs. 1–5, Discussion 1–5 and Tables 1–5.

Reporting Summary

Source data

Source Data Fig. 1

The structural optimization of compound 1. The table provides the Source Data for Fig. 1 and also includes the SlogP and BBBP prediction values of the original and optimized molecules. The table also presents information on the top-10 optimized molecules by Prompt-MolOptP on the basis of prompt tokens ‘BBBP’ and ‘BBBP lipop’, including the SMILES, SlogP and BBBP prediction values.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, Z., Zhang, O., Wang, X. et al. Leveraging language model for advanced multiproperty molecular optimization via prompt engineering. Nat Mach Intell 6, 1359–1369 (2024). https://doi.org/10.1038/s42256-024-00916-5

Download citation

Received: 13 December 2023
Accepted: 24 September 2024
Published: 21 October 2024
Issue Date: November 2024
DOI: https://doi.org/10.1038/s42256-024-00916-5