Abstract
We explore machine unlearning in the ___domain of large language models (LLMs), referred to as LLM unlearning. This initiative aims to eliminate undesirable data influence (for example, sensitive or illegal information) and the associated model capabilities, while maintaining the integrity of essential knowledge generation and not affecting causally unrelated information. We envision LLM unlearning becoming a pivotal element in the life-cycle management of LLMs, potentially standing as an essential foundation for developing generative artificial intelligence that is not only safe, secure and trustworthy but also resource-efficient without the need for full retraining. We navigate the unlearning landscape in LLMs from conceptual formulation, methodologies, metrics and applications. In particular, we highlight the often-overlooked aspects of existing LLM unlearning research, for example, unlearning scope, data–model interaction and multifaceted efficacy assessment. We also draw connections between LLM unlearning and related areas such as model editing, influence functions, model explanation, adversarial training and reinforcement learning. Furthermore, we outline an effective assessment framework for LLM unlearning and explore its applications in copyright and privacy safeguards and sociotechnical harm reduction.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
27,99 € / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
118,99 € per year
only 9,92 € per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others
References
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? In Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency 610–623 (Association for Computing Machinery, 2021).
Motoki, F., Pinho Neto, V. & Rodrigues, V. More human than human: measuring ChatGPT political bias. Public Choice 198, 3–23 (2024).
Kotek, H., Dockum, R. & Sun, D. Gender bias and stereotypes in large language models. In Proc. ACM Collective Intelligence Conference (eds Bernstein, M. et al.) 12–24 (Association for Computing Machinery, 2023).
Nasr, M. et al. Scalable extraction of training data from (production) language models. Preprint at https://arxiv.org/abs/2311.17035 (2023).
Wen, J. et al. Unveiling the implicit toxicity in large language models. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 1322–1338 (Association for Computational Linguistics, 2023).
Karamolegkou, A., Li, J., Zhou, L. & Søgaard, A. Copyright violations and large language models. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 7403–7412 (Association for Computational Linguistics, 2023).
Patil, V., Hase, P. & Bansal, M. Can sensitive information be deleted from LLMs? Objectives for defending against extraction attacks. In Proc. 12th International Conference on Learning Representations (ICLR, 2024).
Wei, A., Haghtalab, N. & Steinhardt, J. Jailbroken: how does LLM safety training fail? In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NeurIPS, 2023).
Zou, A., Wang, Z., Kolter, J. Z. & Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. Preprint at https://arxiv.org/abs/2307.15043 (2023).
Liu, Y. et al. Jailbreaking chatgpt via prompt engineering: an empirical study. Preprint at https://arxiv.org/abs/2305.13860 (2023).
Barrett, C. et al. Identifying and mitigating the security risks of generative AI. Found. Trends Priv. Secur. 6, 1–52 (2023).
Hendrycks, D., Mazeika, M. & Woodside, T. An overview of catastrophic AI risks. Preprint at https://arxiv.org/abs/2306.12001 (2023).
Li, N. et al. The WMDP benchmark: measuring and reducing malicious use with unlearning. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 1–30 (PMLR, 2024).
Brown, T. et al. Language models are few-shot learners. In Proc. 34th Conference on Neural Information Processing Systems (eds Larochelle, H. et al.) (NeurIPS, 2020).
Yao, J. et al. Machine unlearning of pre-trained large language models. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (eds Ku, L.-W. et al.) 8403–8419 (Association for Computational Linguistics, 2024).
Maini, P., Feng, Z., Schwarzschild, A., Lipton, Z. C. & Kolter, J. Z. TOFU: a task of fictitious unlearning for LLMs. In Proc. 1st Conference on Language Modeling (COLM, 2024).
Cao, Y. & Yang, J. Towards making systems forget with machine unlearning. In 2015 IEEE Symposium on Security and Privacy 463–480 (IEEE, 2015).
Bourtoule, L. et al. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy 141–159 (IEEE, 2021).
Nguyen, T. T. et al. A survey of machine unlearning. Preprint at https://arxiv.org/abs/2209.02299 (2022).
Si, N. et al. Knowledge unlearning for LLMs: tasks, methods, and challenges. Preprint at https://arxiv.org/abs/2311.15766 (2023).
Zhang, D. et al. Right to be forgotten in the era of large language models: implications, challenges, and solutions. AI Ethics https://doi.org/10.1007/s43681-024-00573-9 (2024).
Eldan, R. & Russinovich, M. Who’s Harry Potter? Approximate unlearning in LLMs. Preprint at https://arxiv.org/pdf/2310.02238 (2023).
Yao, Y., Xu, X. & Liu, Y. Large language model unlearning. In Proc. 38th Annual Conference on Neural Information Processing Systems (eds Globerson, A. et al.) (NeurIPS, 2024).
Ginart, A., Guan, M., Valiant, G. & Zou, J. Y. Making AI forget you: data deletion in machine learning. In Proc. 33rd Conference on Neural Information Processing Systems (eds Wallach, H. et al.) (NeurIPS, 2019).
Neel, S., Roth, A. & Sharifi-Malvajerdi, S. Descent-to-delete: gradient-based methods for machine unlearning. In Proc. 32nd International Conference on Algorithmic Learning Theory (eds Feldman, V. et al.) 931–962 (PMLR, 2021).
Ullah, E., Mai, T., Rao, A., Rossi, R. A. & Arora, R. Machine unlearning via algorithmic stability. In Proc. 34th Conference on Learning Theory (eds Belkin, M. & Kpotufe, S.) 4126–4142 (PMLR, 2021).
Sekhari, A., Acharya, J., Kamath, G. & Suresh, A. T. Remember what you want to forget: algorithms for machine unlearning. In Proc. 35th Conference on Neural Information Processing Systems (eds Ranzato, M. et al.) (NeurIPS, 2021).
Golatkar, A., Achille, A. & Soatto, S. Eternal sunshine of the spotless net: selective forgetting in deep networks. In Proc. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9301–9309 (IEEE, 2020).
Jia, J. et al. Model sparsity can simplify machine unlearning. In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NeurIPS, 2023).
Lu, X. et al. Quark: controllable text generation with reinforced unlearning. In Proc. 36th Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) (NeurIPS, 2022).
Jang, J. et al. Knowledge unlearning for mitigating privacy risks in language models. In Proc. 61st Annual Meeting of the Association for Computational Linguistics (eds Rogers, A. et al.) 14389–14408 (Association for Computational Linguistics, 2023).
Wu, X. et al. DEPN: detecting and editing privacy neurons in pretrained language models. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 2875–2886 (Association for Computational Linguistics, 2023).
Yu, C., Jeoung, S., Kasi, A., Yu, P. & Ji, H. Unlearning bias in language models by partitioning gradients. In Findings of the Association for Computational Linguistics: ACL 2023 (eds Rogers, A. et al.) 6032–6048 (Association for Computational Linguistics, 2023).
Hoofnagle, C. J., van der van der Sloot, B. & Zuiderveen Borgesius, F. The European Union General Data Protection Regulation: what it is and what it means. Info. Commun. Technol. Law 28, 65–98 (2019).
Gandikota, R., Materzynska, J., Fiotto-Kaufman, J. & Bau, D. Erasing concepts from diffusion models. In Proc. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) 2426–2436 (IEEE, 2023).
Zhang, G., Wang, K., Xu, X., Wang, Z. & Shi, H. Forget-me-not: learning to forget in text-to-image diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 1755–1764 (IEEE, 2024).
Kumari, N. et al. Ablating concepts in text-to-image diffusion models. In Proc. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) 22634–22645 (IEEE, 2023).
Fan, C. et al. SalUn: empowering machine unlearning via gradient-based weight saliency in both image classification and generation. In Proc. 12th International Conference on Learning Representations (ICLR, 2024).
Liu, G., Ma, X., Yang, Y., Wang, C. & Liu, J. Federated unlearning. Preprint at https://arxiv.org/abs/2012.13891 (2020).
Wang, J., Guo, S., Xie, X. & Qi, H. Federated unlearning via class-discriminative pruning. In Proc. ACM Web Conference 2022 (eds LaForest, F. et al.) 622–632 (Association for Computing Machinery, 2022).
Che, T. et al. Fast federated machine unlearning with nonlinear functional theory. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 4241–4268 (PMLR, 2023).
Liu, Z. et al. A survey on federated unlearning: challenges, methods, and future directions. ACM Comput. Surv. 57, 2 (2024).
Halimi, A., Kadhe, S., Rawat, A. & Baracaldo, N. Federated unlearning: how to efficiently erase a client in FL? Preprint at https://arxiv.org/abs/2207.05521 (2022).
Chen, M. et al. Graph unlearning. In Proc. 2022 ACM SIGSAC Conference on Computer and Communications Security 499–513 (Association for Computing Machinery, 2022).
Chien, E., Pan, C. & Milenkovic, O. Efficient model updates for approximate unlearning of graph-structured data. In Proc. 11th International Conference on Learning Representations (ICLR, 2023).
Wu, K., Shen, J., Ning, Y., Wang, T. & Wang, W. H. Certified edge unlearning for graph neural networks. In Proc. 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2606–2617 (Association for Computing Machinery, 2023).
Sachdeva, B. et al. Machine unlearning for recommendation systems: an insight. In Proc. Innovative Computing and Communications (ICICC 2024) (eds Hassanien, A. E. et al.) 415–430 (Springer, 2024).
Chen, C., Sun, F., Zhang, M. & Ding, B. Recommendation unlearning. In Proc. ACM Web Conference 2022 (eds LaForest, F. et al.) 2768–2777 (Association for Computing Machinery, 2022).
Xu, M., Sun, J., Yang, X., Yao, K. & Wang, C. Netflix and forget: efficient and exact machine unlearning from bi-linear recommendations. Preprint at https://arxiv.org/abs/2302.06676 (2023).
Li, Y., Chen, C., Zheng, X., Liu, J. & Wang, J. Making recommender systems forget: learning and unlearning for erasable recommendation. Knowl. Based Syst. 283, 111124 (2024).
Wang, H. et al. Towards efficient and effective unlearning of large language models for recommendation. Front. Comput. Sci. 19, 193327 (2025).
Thudi, A., Deza, G., Chandrasekaran, V. & Papernot, N. Unrolling SGD: understanding factors influencing machine unlearning. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P) 303–319 (IEEE, 2022).
Warnecke, A., Pirch, L., Wressnegger, C. & Rieck, K. Machine unlearning of features and labels. In Network and Distributed System Security (NDSS) Symposium (Internet Society, 2023).
Becker, A. & Liebig, T. Evaluating machine unlearning via epistemic uncertainty. Preprint at https://arxiv.org/abs/2208.10836 (2022).
Chen, M., Gao, W., Liu, G., Peng, K. & Wang, C. Boundary unlearning: rapid forgetting of deep networks via shifting the decision boundary. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 7766–7775 (IEEE, 2023).
Guo, C., Goldstein, T., Hannun, A. & Van Der Maaten, L. Certified data removal from machine learning models. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. III & Singh, A.) 3832–3842 (PMLR, 2020).
Ilharco, G. et al. Editing models with task arithmetic. In Proc. 11th International Conference on Learning Representations (ICLR, 2023).
Zhang, J., Chen, S., Liu, J. & He, J. Composing parameter-efficient modules with arithmetic operations. In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NeurIPS, 2023).
Lee, D., Rim, D., Choi, M. & Choo, J. Protecting privacy through approximating optimal parameters for sequence unlearning in language models. In Findings of the Association for Computational Linguistics: ACL 2024 (eds Ku, L.-W. et al.) 15820–15839 (Association for Computational Linguistics, 2024).
Bucknall, B. S. & Trager, R. F. Structured access for third-party research on frontier AI models: investigating researchers’model access requirements. White Paper October 2023 (Centre for the Governance of AI, University of Oxford, 2023).
Casper, S. et al. Black-box access is insufficient for rigorous AI audits. In Proc. 2024 ACM Conference on Fairness, Accountability, and Transparency (FACCT) 2254–2272 (Association for Computing Machinery, 2024).
Pawelczyk, M., Neel, S. & Lakkaraju, H. In-context unlearning: language models as few-shot unlearners. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 40034–40050 (PMLR, 2024).
Wei, J. et al. Emergent abilities of large language models. Transactions on Machine Learning Research https://openreview.net/forum?id=yzkSU5zdwD (2022).
Schaeffer, R., Miranda, B. & Koyejo, S. Are emergent abilities of large language models a mirage? In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NeurIPS, 2023)
Mitchell, E., Lin, C., Bosselut, A., Manning, C. D. & Finn, C. Memory-based model editing at scale. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 15817–15831 (PMLR, 2022).
Shi, W. et al. Detecting pretraining data from large language models. In Proc. 12th International Conference on Learning Representations (ICLR, 2024).
Lynch, A., Guo, P., Ewart, A., Casper, S. & Hadfield-Menell, D. Eight methods to evaluate robust unlearning in LLMs. Preprint at https://arxiv.org/abs/2402.16835 (2024).
Zhang, J. et al. Min-K%++: improved baseline for detecting pre-training data from large language models. Preprint at https://arxiv.org/abs/2404.02936 (2024).
Hu, S., Fu, Y., Wu, Z. S. & Smith, V. Jogging the memory of unlearned model through targeted relearning attack. Preprint at https://arxiv.org/abs/2406.13356 (2024).
Łucki, J. et al. An adversarial perspective on machine unlearning for AI safety. Preprint at https://arxiv.org/abs/2409.18025 (2024).
Shumailov, I. et al. Ununlearning: unlearning is not sufficient for content regulation in advanced generative AI. Preprint at https://arxiv.org/abs/2407.00106 (2024).
Kurmanji, M., Triantafillou, P., Hayes, E. & Triantafillou, E. Towards unbounded machine unlearning. In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NeurIPS, 2023).
Ishibashi, Y. & Shimodaira, H. Knowledge sanitization of large language models. Preprint at https://arxiv.org/abs/2309.11852 (2023).
Hase, P. et al. Methods for measuring, updating, and visualizing factual beliefs in language models. In Proc. 17th Conference of the European Chapter of the Association for Computational Linguistics (eds Vlachos, A. & Augenstein, I.) 2714–2731 (Association for Computational Linguistics, 2023).
Cohen, R., Biran, E., Yoran, O., Globerson, A. & Geva, M. Evaluating the ripple effects of knowledge editing in language models. Trans. Assoc. Comput. Linguist. 12, 283–298 (2024).
Zhang, Y. et al. Revisiting zeroth-order optimization for memory-efficient LLM fine-tuning: a benchmark. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) (PMLR, 2024).
Li, M., Davies, X. & Nadeau, M. Circuit breaking: removing model behaviors with targeted ablation. In Proc. ICML 2023 Workshop on Challenges in Deployable Generative AI (OpenReview.net, 2023).
Jia, J. et al. SOUL: unlocking the power of second-order optimization for LLM unlearning. In Proc. 2024 Conference on Empirical Methods in Natural Language Processing (eds Al-Onaizan, Y. et al.) 4276–4292 (Association for Computational Linguistics, 2024).
Liu, H., Li, Z., Hall, D., Liang, P. & Ma, T. Sophia: a scalable stochastic second-order optimizer for language model pre-training. In Proc. 12th International Conference on Learning Representations (ICLR, 2024).
Zhang, R., Lin, L., Bai, Y. & Mei, S. Negative preference optimization: from catastrophic collapse to effective unlearning. In Proc. 1st Conference on Language Modeling (COLM, 2024).
Chen, J. & Yang, D. Unlearn what you want to forget: efficient unlearning for LLMs. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 12041–12052 (Association for Computational Linguistics, 2023).
Hase, P., Bansal, M., Kim, B. & Ghandeharioun, A. Does localization inform editing? Surprising differences in causality-based localization vs. knowledge editing in language models. In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NeurIPS, 2023).
Rafailov, R. et al. Direct preference optimization: your language model is secretly a reward model. In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NeurIPS, 2023).
Madaan, A., Tandon, N., Clark, P. & Yang, Y. Memory-assisted prompt editing to improve GPT-3 after deployment. In ACL 2022 Workshop on Commonsense Representation and Reasoning (OpenReview.net, 2022).
Zheng, C. et al. Can we edit factual knowledge by in-context learning? In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 4862–4876 (Association for Computational Linguistics, 2023).
Thaker, P., Maurya, Y. & Smith, V. Guardrail baselines for unlearning in LLMs. Preprint at https://arxiv.org/abs/2403.03329 (2024).
Muresanu, A., Thudi, A., Zhang, M. R. & Papernot, N. Unlearnable algorithms for in-context learning. Preprint at https://arxiv.org/abs/2402.00751 (2024).
Liu, C. Y., Wang, Y., Flanigan, J. & Liu, Y. Large language model unlearning via embedding-corrupted prompts. In Proc. 38th Annual Conference on Neural Information Processing Systems (eds Globerson, A. et al.) (NeurIPS, 2024).
Meng, K., Bau, D., Andonian, A. & Belinkov, Y. Locating and editing factual associations in GPT. In Proc. 36th Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) (NeurIPS, 2022).
Jia, J. et al. WAGLE: strategic weight attribution for effective and modular unlearning in large language models. In Proc. 38th Annual Conference on Neural Information Processing Systems (eds Globerson, A. et al.) (NeurIPS, 2024).
Menik, S. & Ramaswamy, L. Towards modular machine learning solution development: benefits and trade-offs. Preprint at https://arxiv.org/abs/2301.09753 (2023).
Koh, P. W. & Liang, P. Understanding black-box predictions via influence functions. In Proc. 34th International Conference on Machine Learning (eds Precup, D. & Teh, Y. W.) 1885–1894 (PMLR, 2017).
Bae, J., Ng, N., Lo, A., Ghassemi, M. & Grosse, R. B. If influence functions are the answer, then what is the question? In Proc. 36th Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) (NeurIPS, 2022).
Izzo, Z., Smart, M. A., Chaudhuri, K. & Zou, J. Approximate data deletion from machine learning models. In Proc. 24th International Conference on Artificial Intelligence and Statistics (eds Banerjee, A. & Fukumizu, K.) 2008–2016 (PMLR, 2021).
Athalye, A., Carlini, N. & Wagner, D. Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In Proc. 35th International Conference on Machine Learning (eds Dy, J. & Krause, A.) 274–283 (PMLR, 2018).
Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. Towards deep learning models resistant to adversarial attacks. In Proc. 6th International Conference on Learning Representations (ICLR, 2018).
Zhang, Y. et al. Revisiting and advancing fast adversarial training through the lens of bi-level optimization. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 26693–26712 (PMLR, 2022).
Dai, D. et al. Knowledge neurons in pretrained transformers. In Proc. 60th Annual Meeting of the Association for Computational Linguistics (eds Muresan, S. et al.) 8493–8502 (Association for Computational Linguistics, 2022).
Gupta, A. et al. Editing common sense in transformers. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 8214–8232 (Association for Computational Linguistics, 2023).
Geva, M., Schuster, R., Berant, J. & Levy, O. Transformer feed-forward layers are key-value memories. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing (eds Moens, M.-F. et al.) 5484–5495 (Association for Computational Linguistics, 2021).
Li, X. et al. PMET: precise model editing in a transformer. In Proc. 38th AAAI Conference on Artificial Intelligence and 36th Conference on Innovative Applications of Artificial Intelligence and 14th Symposium on Educational Advances in Artificial Intelligence (AAAI'24/IAAI'24/EAAI'24) (eds Wooldridge, M. et al.) 18564–18572 (AAAI Press, 2024).
Li, D. et al. Large language models with controllable working memory. In Findings of the Association for Computational Linguistics: ACL 2023 (eds Rogers, A. et al.)1774–1793 (Association for Computational Linguistics, 2023).
Zhang, Y. et al. To generate or not? Safety-driven unlearned diffusion models are still easy to generate unsafe images … for now. In Proc. 18th European Conference on Computer Vision (ECCV), Part LVII (eds Leonardis, A. et al.) 385–403 (Springer, 2024).
Zhang, Z. et al. Does your LLM truly unlearn? An embarrassingly simple approach to recover unlearned knowledge. Preprint at https://arxiv.org/abs/2410.16454 (2024).
Tamirisa, R. et al. Tamper-resistant safeguards for open-weight LLMs. Preprint at https://arxiv.org/abs/2408.00761 (2024).
Zhang, Y. et al. An introduction to bilevel optimization: foundations and applications in signal processing and machine learning. IEEE Signal Process. Mag. 41, 38–59 (2024).
Zhang, Y. et al. Defensive unlearning with adversarial training for robust concept erasure in diffusion models. In Proc. 38th Annual Conference on Neural Information Processing Systems (eds Globerson, A. et al.) (NeurIPS, 2024).
Shafahi, A. et al. Adversarial training for free! In Proc. 33rd Conference on Neural Information Processing Systems (eds Wallach, H. et al.) (NeurIPS, 2019).
Wong, E., Rice, L. & Kolter, J. Z. Fast is better than free: revisiting adversarial training. In Proc. 8th International Conference on Learning Representations (ICLR, 2020).
Zhu, C. et al. FreeLB: enhanced adversarial training for natural language understanding. In Proc. 8th International Conference on Learning Representations (ICLR, 2020).
Kumari, N. et al. Harnessing the vulnerability of latent layers in adversarially trained models. In Proc. 28th International Joint Conference on Artificial Intelligence (ed. Kraus, S.) 2779–2785 (IJCAI, 2019).
Robey, A., Latorre, F., Pappas, G. J., Hassani, H. & Cevher, V. Adversarial training should be cast as a non-zero-sum game. In Proc. 12th International Conference on Learning Representations (ICLR, 2024).
Casper, S., Schulze, L., Patel, O. & Hadfield-Menell, D. Defending against unforeseen failure modes with latent adversarial training. Preprint at https://arxiv.org/abs/2403.05030 (2024).
Christiano, P. F. et al. Deep reinforcement learning from human preferences. In Proc. 31st Annual Conference on Neural Information Processing Systems (eds Guyon, I. et al.) (NIPS, 2017).
Ouyang, L. et al. Training language models to follow instructions with human feedback. In Proc. 36th Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) (NeurIPS, 2022).
Bai, Y. et al. Constitutional AI: harmlessness from AI feedback. Preprint at https://arxiv.org/abs/2212.08073 (2022).
Yuan, H. et al. RRHF: rank responses to align language models with human feedback. In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NeurIPS, 2023).
Lee, H. et al. RLAIF vs. RLHF: scaling reinforcement learning from human feedback with AI feedback. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 26874–26901 (PMLR, 2024).
Casper, S. et al. Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research https://openreview.net/forum?id=bx24KpJ4Eb (2023).
Zhang, Y. et al. UnlearnCanvas: stylized image dataset for enhanced machine unlearning evaluation in diffusion models. In Proc. 38th Conference on Neural Information Processing Systems, Datasets and Benchmarks Track (eds Globerson, A. et al.) (NeurIPS, 2024).
Qi, X. et al. Fine-tuning aligned language models compromises safety, even when users do not intend to! In Proc. 12th International Conference on Learning Representations (ICLR, 2024).
Shi, W. et al. MUSE: machine unlearning six-way evaluation for language models. Preprint at https://arxiv.org/abs/2407.06460 (2024).
Gehman, S., Gururangan, S., Sap, M., Choi, Y. & Smith, N. A. RealToxicityPrompts: evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds Cohn, T. et al.) 3356–3369 (Association for Computational Linguistics, 2020).
Zhong, Z., Wu, Z., Manning, C. D., Potts, C. & Chen, D. MQuAKE: assessing knowledge editing in language models via multi-hop questions. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 15686–15702 (Association for Computational Linguistics, 2023).
Yong, Z.-X., Menghini, C. & Bach, S. H. Low-resource languages jailbreak GPT-4. In NeurIPS 2023 Workshop SoLaR (OpenReview.net, 2023).
Fan, C., Liu, J., Hero, A. & Liu, S. Challenging forgets: unveiling the worst-case forget sets in machine unlearning. In Proc. 18th European Conference on Computer Vision (ECCV), Part XXI (eds Leonardis, A. et al.) 278–297 (Springer, 2024).
Zhao, K., Kurmanji, M., Bărbulescu, G.-O., Triantafillou, E. & Triantafillou, P. What makes unlearning hard and what to do about it. In Proc. 38th Annual Conference on Neural Information Processing Systems (eds Globerson, A. et al.) (NeurIPS, 2024).
Yang, X. et al. Shadow alignment: the ease of subverting safely-aligned language models. Preprint at https://arxiv.org/abs/2310.02949 (2023).
Lermen, S., Rogers-Smith, C. & Ladish, J. LoRA fine-tuning efficiently undoes safety training in Llama 2-Chat 70B. Preprint at https://arxiv.org/abs/2310.20624 (2023).
Shokri, R., Stronati, M., Song, C. & Shmatikov, V. Membership inference attacks against machine learning models. In Proc. 2017 IEEE Symposium on Security and Privacy (SP) 3–18 (IEEE, 2017).
Carlini, N. et al. Quantifying memorization across neural language models. In Proc. 11th International Conference on Learning Representations (ICLR, 2023).
Duan, M. et al. Do membership inference attacks work on large language models? In Proc. 1st Conference on Language Modeling (COLM, 2024).
Thudi, A., Jia, H., Shumailov, I. & Papernot, N. On the necessity of auditable algorithmic definitions for machine unlearning. In Proc. 31st USENIX Security Symposium (USENIX Security '22) 4007–4022 (USENIX Association, 2022).
Suliman, M. et al. Data forging is harder than you think. In ICLR 2024 Workshop Privacy Regulation and Protection in Machine Learning (OpenReview.net, 2024).
Li, J. et al. Single image unlearning: efficient machine unlearning in multimodal large language models. Preprint at https://arxiv.org/abs/2405.12523 (2024).
Bărbulescu, G.-O. & Triantafillou, P. To each (textual sequence) its own: improving memorized-data unlearning in large language models. In Proc. 41st International Conference on Machine Learning (eds Salakhutdinov, R. et al.) 3003–3023 (PMLR, 2024).
Malladi, S. et al. Fine-tuning language models with just forward passes. In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NeurIPS, 2023).
Li, T. C. Algorithmic destruction. SMU Law Rev. 75, 479 (2022).
Goland, J. A. Algorithmic disgorgement: destruction of artificial intelligence models as the FTC’s newest enforcement tool for bad data. Richmond J. Law Technol. 29, 1 (2023).
Belkadi, L. & Jasserand, C. From algorithmic destruction to algorithmic imprint: generative AI and privacy risks linked to potential traces of personal data in trained models. In ICML 1st Workshop on Generative AI and Law (GenLaw, 2023).
Achille, A., Kearns, M., Klingenberg, C. & Soatto, S. AI model disgorgement: methods and choices. Proc. Natl Acad. Sci. USA 121, e2307304121 (2024).
Small, Z. Sarah Silverman sues OpenAI and Meta over copyright infringement. The New York Times https://www.nytimes.com/2023/07/10/arts/sarah-silverman-lawsuit-openai-meta.html (2023).
Grynbaum, M. & Mac, R. The Times sues OpenAI and Microsoft over A.I. use of copyrighted work. The New York Times https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html (2023).
Li, D. et al. A survey of large language models attribution. Preprint at https://arxiv.org/abs/2311.03731 (2023).
Gao, Y. et al. Retrieval-augmented generation for large language models: a survey. Preprint at https://arxiv.org/abs/2312.10997 (2023).
Carlini, N., Liu, C., Erlingsson, Ú., Kos, J. & Song, D. The secret sharer: evaluating and testing unintended memorization in neural networks. In Proc. 28th USENIX Security Symposium (USENIX Security '19) 267–284 (USENIX Association, 2019).
Carlini, N. et al. Extracting training data from large language models. In Proc. 30th USENIX Security Symposium (USENIX Security '21) 2633–2650 (USENIX Association, 2021).
Shevlane, T. et al. Model evaluation for extreme risks. Preprint at https://arxiv.org/abs/2305.15324 (2023).
Perez, E. et al. Discovering language model behaviors with model-written evaluations. In Findings of the Association for Computational Linguistics: ACL 2023 (eds Rogers, A. et al.) 13387–13434 (Association for Computational Linguistics, 2023).
Tamkin, A. et al. Evaluating and mitigating discrimination in language model decisions. Preprint at https://arxiv.org/abs/2312.03689 (2023).
Cui, C. et al. Holistic analysis of hallucination in GPT-4V(ision): bias and interference challenges. Preprint at https://arxiv.org/abs/2311.03287 (2023).
He, H., Zha, S. & Wang, H. Unlearn dataset bias in natural language inference by fitting the residual. In Proc. 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019) (eds Cherry, C. et al.) 132–142 (Association for Computational Linguistics, 2019).
Sattigeri, P., Ghosh, S., Padhi, I., Dognin, P. & Varshney, K. R. Fair infinitesimal jackknife: mitigating the influence of biased training data points without refitting. In Proc. 36th Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) (NeurIPS, 2022).
Chen, R. et al. Fast model debias with machine unlearning. In Proc. 37th Conference on Neural Information Processing Systems (eds Oh, A. et al.) (NeurIPS, 2023).
Dreyer, M., Pahde, F., Anders, C. J., Samek, W. & Lapuschkin, S. From hope to safety: unlearning biases of deep models via gradient penalization in latent space. In Proc. 38th AAAI Conference on Artificial Intelligence and 36th Conference on Innovative Applications of Artificial Intelligence and 14th Symposium on Educational Advances in Artificial Intelligence (AAAI'24/IAAI'24/EAAI'24) (eds Wooldridge, M. et al.) 21046–21054 (AAAI Press, 2024).
Oesterling, A., Ma, J., Calmon, F. P. & Lakkaraju, H. Fair machine unlearning: data removal while mitigating disparities. In Proc. 27th International Conference on Artificial Intelligence and Statistics (eds Dasgupta, S. et al.) 3736–3744 (PMLR, 2024).
Kadhe, S. R., Halimi, A., Rawat, A. & Baracaldo, N. FairSISA: ensemble post-processing to improve fairness of unlearning in llms.Preprint at https://arxiv.org/abs/2312.07420 (2023).
Huang, Y., Gupta, S., Xia, M., Li, K. & Chen, D. Catastrophic jailbreak of open-source llms via exploiting generation. Preprint at https://arxiv.org/abs/2310.06987 (2023).
Rando, J. & Tramèr, F. Universal jailbreak backdoors from poisoned human feedback. In Proc. 12th International Conference on Learning Representations (ICLR, 2024).
Carlini, N. et al. Poisoning web-scale training datasets is practical. Preprint at https://arxiv.org/abs/2302.10149 (2023).
Hubinger, E. et al. Sleeper agents: training deceptive LLMs that persist through safety training. Preprint at https://arxiv.org/abs/2401.05566 (2024).
Wang, B. et al. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP) 707–723 (IEEE, 2019).
Li, Y. et al. Anti-backdoor learning: training clean models on poisoned data. In Proc. 35th Conference on Neural Information Processing Systems (eds Ranzato, M. et al.) (NeurIPS, 2021).
Liu, Y. et al. Backdoor defense with machine unlearning. In Proc. IEEE INFOCOM 2022-IEEE Conference on Computer Communications 280–289 (IEEE, 2022).
Kumar, V. B., Gangadharaiah, R. & Roth, D. Privacy adhering machine un-learning in NLP. In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (eds Park, J. C. et al.) 268–277 (Association for Computational Linguistics, 2023).
Wang, L. et al. KGA: a general machine unlearning framework based on knowledge gap alignment. In Proc. 61st Annual Meeting of the Association for Computational Linguistics (eds Rogers, A. et al.) 13264–13276 (Association for Computational Linguistics, 2023).
Wang, Y., Wu, R., He, Z., Chen, X. & McAuley, J. Large scale knowledge washing. Preprint at https://arxiv.org/abs/2405.16720 (2024).
Wang, B., Zi, Y., Sun, Y., Zhao, Y. & Qin, B. RKLD: reverse KL-divergence-based knowledge distillation for unlearning personal information in large language models. Preprint at https://arxiv.org/abs/2406.01983 (2024).
Author information
Authors and Affiliations
Contributions
S.L. and Y.L. contributed to the conceptualization of the paper. S.L., Yuanshun Yao, J.J., Yuguang Yao and Y.L. prepared the initial draft. Major revisions to the initial draft were contributed by S.C., N.B., P.H., C.Y.L., K.R.V. and M.B., with additional feedback provided by X.X., H.L. and S.K. J.J. contributed to the experimental studies and analyses, and Yuguang Yao and J.J. developed Fig. 1. All authors participated in discussions, contributed to edits and approved the final version of the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Tom Hartvigsen, Wenxuan Zhou and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, S., Yao, Y., Jia, J. et al. Rethinking machine unlearning for large language models. Nat Mach Intell 7, 181–194 (2025). https://doi.org/10.1038/s42256-025-00985-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-025-00985-0
This article is cited by
-
A call for built-in biosecurity safeguards for generative AI tools
Nature Biotechnology (2025)
-
Transparency (in training data) is what we want
Nature Machine Intelligence (2025)