Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Leveraging language model for advanced multiproperty molecular optimization via prompt engineering

Abstract

Optimizing a candidate molecule’s physiochemical and functional properties has been a critical task in drug and material design. Although the non-trivial task of balancing multiple (potentially conflicting) optimization objectives is considered ideal for artificial intelligence, several technical challenges such as the scarcity of multiproperty-labelled training data have hindered the development of a satisfactory AI solution for a long time. Prompt-MolOpt is a tool for molecular optimization; it makes use of prompt-based embeddings, as used in large language models, to improve the transformer’s ability to optimize molecules for specific property adjustments. Notably, Prompt-MolOpt excels in working with limited multiproperty data (even under the zero-shot setting) by effectively generalizing causal relationships learned from single-property datasets. In comparative evaluations against established models such as JTNN, hierG2G and Modof, Prompt-MolOpt achieves over a 15% relative improvement in multiproperty optimization success rates compared with the leading Modof model. Furthermore, a variant of Prompt-MolOpt, named Prompt-MolOptP, can preserve the pharmacophores or any user-specified fragments under the structural transformation, further broadening its application scope. By constructing tailored optimization datasets, with the protocol introduced in this work, Prompt-MolOpt steers molecular optimization towards ___domain-relevant chemical spaces, enhancing the quality of the optimized molecules. Real-world tests, such as those involving blood–brain barrier permeability optimization, underscore its practical relevance. Prompt-MolOpt offers a versatile approach for multiproperty and multi-site molecular optimizations, suggesting its potential utility in chemistry research and drug and material discovery.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The real-world’s multiproperty and multi-site BBBP optimization case study of Prompt-MolOptP.
Fig. 2: The construction of the molecular optimization dataset.
Fig. 3: The overall workflow of Prompt-MolOpt.
Fig. 4: Overview of the molecular optimization framework of Prompt-MolOptP.

Similar content being viewed by others

Data availability

The datasets used in this study and the data generated in this study are available at https://github.com/wzxxxx/Prompt-MolOpt and https://doi.org/10.5281/zenodo.11080951 (ref. 47). Source data are provided with this paper.

Code availability

The models were implemented using Python (v.3.6.13) with dgl(v.0.7.1) and PyTorch (v.1.6.0). The data processing and metrics calculation were implemented using Python (v.3.6.13) with scikit-learn (v.0.21.3), NumPy (v.1.19.2) and Pandas (v.1.1.5). The code for Prompt-MolOpt, Prompt-MolOptm and Prompt-MolOptP is publicly available at https://github.com/wzxxxx/Prompt-MolOpt and https://doi.org/10.5281/zenodo.11080951 (ref. 47).

References

  1. Fromer, J. C. & Coley, C. W. Computer-aided multi-objective optimization in small molecule discovery. Patterns 4, 100678 (2023).

    Article  Google Scholar 

  2. Nicolaou, C. A. & Brown, N. Multi-objective optimization methods in drug design. Drug Discov. Today Technol. 10, e427–e435 (2013).

    Article  Google Scholar 

  3. Jorgensen, W. L. Efficient drug lead discovery and optimization. Acc. Chem. Res. 42, 724–733 (2009).

    Article  Google Scholar 

  4. Leelananda, S. P. & Lindert, S. Computational methods in drug discovery. Beilstein J. Org. Chem. 12, 2694–2718 (2016).

    Article  Google Scholar 

  5. Zhang, X. et al. Efficient and accurate large library ligand docking with KarmaDock. Nat. Comput. Sci. 3, 789–804 (2023).

    Article  Google Scholar 

  6. Shen, C. et al. Boosting protein–ligand binding pose prediction and virtual screening based on residue–atom distance likelihood potential and graph transformer. J. Med. Chem. 65, 10691–10706 (2022).

    Article  Google Scholar 

  7. Maia, E. H. B., Assis, L. C., De Oliveira, T. A., Da Silva, A. M. & Taranto, A. G. Structure-based virtual screening: from classical to artificial intelligence. Front. Chem. 8, 343 (2020).

    Article  Google Scholar 

  8. Gentile, F. et al. Artificial intelligence-enabled virtual screening of ultra-large chemical libraries with deep docking. Nat. Protoc. 17, 672–697 (2022).

  9. Choung, O.-H., Vianello, R., Segler, M., Stiefl, N. & Jiménez-Luna, J. Extracting medicinal chemistry intuition via preference machine learning. Nat. Commun. 14, 6651 (2023).

    Article  Google Scholar 

  10. Cheshire, D. R. How well do medicinal chemists learn from experience? Drug Discov. Today 16, 817–821 (2011).

    Article  Google Scholar 

  11. Shan, J. & Ji, C. MolOpt: a web server for drug design using bioisosteric transformation. Curr. Comput. Aided Drug Des. 16, 460–466 (2020).

  12. Yang, H. et al. ADMETopt: a web server for ADMET optimization in drug design via scaffold hopping. J. Chem. Inf. Model. 58, 2051–2056 (2018).

    Article  Google Scholar 

  13. Dossetter, A. G., Griffen, E. J. & Leach, A. G. Matched molecular pair analysis in drug discovery. Drug Discovery Today 18, 724–731 (2013).

    Article  Google Scholar 

  14. Tu, Z. & Coley, C. W. Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction. J. Chem. Inf. Model. 62, 3503–3513 (2022).

    Article  Google Scholar 

  15. Jin, W., Barzilay, R. & Jaakkola, T. Multi-objective molecule generation using interpretable substructures. In Proc. 37th International Conference on Machine Learning 4849–4859 (PMLR, 2020).

  16. Kong, D. et al. Dual-space optimization: improved molecule sequence design by latent prompt transformer. Preprint at https://arxiv.org/abs/2402.17179 (2024).

  17. Shi, C. et al. GraphAF: a flow-based autoregressive model for molecular graph generation. Preprint at https://arxiv.org/abs/2001.09382 (2020).

  18. Zang, C. & Wang, F. Moflow: an invertible flow model for generating molecular graphs. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 617–626 (ACM, 2020).

  19. Jin W., Barzilay R. & Jaakkola T. Junction tree variational autoencoder for molecular graph generation. In Proc. 35th International Conference on Machine Learning 2323–2332 (PMLR, 2018).

  20. Podda, M., Bacciu, D. & Micheli, A. A deep generative model for fragment-based molecule generation. In Proc. 23rd International Conference on Artificial Intelligence and Statistics 2240–2250 (PMLR, 2020).

  21. Chen, Z., Min, M. R., Parthasarathy, S. & Ning, X. A deep generative model for molecule optimization via one fragment modification. Nat. Mach. Intell. 3, 1040–1049 (2021).

  22. Floridi, L. & Chiriatti, M. GPT-3: its nature, scope, limits, and consequences. Minds Mach. 30, 681–694 (2020).

    Article  Google Scholar 

  23. Castro Nascimento, C. M. & Pimentel, A. S. Do large language models understand chemistry? A conversation with ChatGPT. J. Chem. Inf. Model. 63, 1649–1655 (2023).

    Article  Google Scholar 

  24. Guo, H., Zhao, S., Wang, H., Du, Y. & Qin, B. Moltailor: tailoring chemical molecular representation to specific tasks via text prompts. Preprint at https://arxiv.org/abs/2401.11403 (2024).

  25. Ye, G. et al. DrugAssist: a large language model for molecule optimization. Preprint at https://arxiv.org/abs/2401.10334 (2023).

  26. Zhou, K., Yang, J., Loy, C. C. & Liu Z. Conditional prompt learning for vision-language models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16816–16825 (IEEE, 2022).

  27. He, Y. et al. HyperPrompt: prompt-based task-conditioning of transformers. Preprint at https://arxiv.org/abs/2203.00759 (2022).

  28. Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55, 1–35 (2023).

    Google Scholar 

  29. Zhang, X. et al. Clamp: prompt-based contrastive learning for connecting language and animal pose. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 23272–23281 (IEEE, 2023).

  30. Teterwak, P., Sun, X., Plummer, B. A., Saenko, K. & Lim S.-N. CLAMP: contrastive language model prompt-tuning. Preperint at https://arxiv.org/abs/2312.01629 (2023).

  31. Born, J. & Manica, M. Regression transformer enables concurrent sequence regression and generation for molecular language modelling. Nat. Mach. Intell. 5, 432–444 (2023).

    Article  Google Scholar 

  32. Seidl, P., Vall, A., Hochreiter, S. & Klambauer, G. Enhancing activity prediction models in drug discovery with the ability to understand human language. In Proc. 40th International Conference on Machine Learning 30458–30490 (PMLR, 2023).

  33. Wu, Z. et al. Chemistry-intuitive explanation of graph neural networks for molecular property prediction with substructure masking. Nat. Commun. 14, 2585 (2023).

    Article  Google Scholar 

  34. Wu, Z. et al. Mining toxicity information from large amounts of toxicity data. J. Med. Chem. 64, 6924–6936 (2021).

    Article  Google Scholar 

  35. Jin, W,. Yang, K., Barzilay, R. & Jaakkola, T. Learning multimodal graph-to-graph translation for molecule optimization. In Proc. International Conference on Learning Representations 856 (ICLR, 2019).

  36. Jin, W., Barzilay, R. & Jaakkola, T. Hierarchical generation of molecular graphs using structural motifs. In Proc. 37th International Conference on Machine Learning 4839–4848 (PMLR, 2020).

  37. Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98 (2012).

    Article  Google Scholar 

  38. Delaney, J. S. ESOL: estimating aqueous solubility directly from molecular structure. J. Chem. Inf. Comput. Sci. 44, 1000–1005 (2004).

    Article  Google Scholar 

  39. Xu, C. et al. In silico prediction of chemical Ames mutagenicity. J. Chem. Inf. Model. 52, 2840–2847 (2012).

    Article  Google Scholar 

  40. Xiong, G., et al. ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Res. 49, W5–W14 (2021).

  41. Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).

    Article  Google Scholar 

  42. Cid, J. M. et al. Discovery of 3-cyclopropylmethyl-7-(4-phenylpiperidin-1-yl)-8-trifluoromethyl [1,2,4] triazolo [4,3-a] pyridine (JNJ-42153605): a positive allosteric modulator of the metabotropic glutamate 2 receptor. J. Med. Chem. 55, 8770–8789 (2012).

    Article  Google Scholar 

  43. Wildman, S. A. & Crippen, G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 39, 868–873 (1999).

    Article  Google Scholar 

  44. Wishart, D. S. et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46, D1074–D1082 (2018).

    Article  Google Scholar 

  45. Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform. 9, 48 (2017).

    Article  Google Scholar 

  46. Schlichtkrull, M. et al. Modeling relational data with graph convolutional networks. In Proc. 15th International Conference, ESWC 593–607 (Springer, 2018).

  47. Zhenxing, W. et al. Leveraging language model for advanced multi-property molecular optimization via prompt engineering. Zenodo https://doi.org/10.5281/zenodo.11080951 (2023).

Download references

Acknowledgements

This study was financially supported by the National Key R&D Program of China (grant number 2021YFF1201400 to T.H.), the National Natural Science Foundation of China (grant numbers 22220102001 to T.H., 82404512 to Z.W. and 22303083 to J.W.), the Natural Science Foundation of Zhejiang Province of China (grant number LD22H300001 to T.H.), the China Postdoctoral Foundation (grant number 2023M742993 to Z.W.) and the Postdoctoral Fellowship Program of CPSF (grant number GZB20240671 to Z.W.).

Author information

Authors and Affiliations

Authors

Contributions

T.H., C.Y.H., D.C. and Z.W. designed the research study. Z.W. developed the method and wrote the code. Z.W., O.Z., X.W., L.F., H.Z., J.W. H.D., D.J. and Y.D. performed the analysis. Z.W., T.H., C.Y.H. and D.C wrote the paper. All authors read and approved the paper.

Corresponding authors

Correspondence to Dongsheng Cao, Chang-Yu Hsieh or Tingjun Hou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Arvind Ramanathan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 The performance of Prompt-MolOpt on single-property optimization
Extended Data Table 2 The performance of Prompt-MolOpt on multiproperty optimization

Supplementary information

Supplementary Information

Supplementary Figs. 1–5, Discussion 1–5 and Tables 1–5.

Reporting Summary

Source data

Source Data Fig. 1

The structural optimization of compound 1. The table provides the Source Data for Fig. 1 and also includes the SlogP and BBBP prediction values of the original and optimized molecules. The table also presents information on the top-10 optimized molecules by Prompt-MolOptP on the basis of prompt tokens ‘BBBP’ and ‘BBBP lipop’, including the SMILES, SlogP and BBBP prediction values.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, Z., Zhang, O., Wang, X. et al. Leveraging language model for advanced multiproperty molecular optimization via prompt engineering. Nat Mach Intell 6, 1359–1369 (2024). https://doi.org/10.1038/s42256-024-00916-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-024-00916-5

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing