Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Custom CRISPR–Cas9 PAM variants via scalable engineering and machine learning

Abstract

Engineering and characterizing proteins can be time-consuming and cumbersome, motivating the development of generalist CRISPR–Cas enzymes1,2,3,4 to enable diverse genome-editing applications. However, such enzymes have caveats such as an increased risk of off-target editing3,5,6. Here, to enable scalable reprogramming of Cas9 enzymes, we combined high-throughput protein engineering with machine learning to derive bespoke editors that are more uniquely suited to specific targets. Through structure–function-informed saturation mutagenesis and bacterial selections, we obtained nearly 1,000 engineered SpCas9 enzymes and characterized their protospacer-adjacent motif (PAM)7 requirements to train a neural network that relates amino acid sequence to PAM specificity. By utilizing the resulting PAM machine learning algorithm (PAMmla) to predict the PAMs of 64 million SpCas9 enzymes, we identified efficacious and specific enzymes that outperform evolution-based and engineered SpCas9 enzymes as nucleases and base editors in human cells while reducing off-targets. An in silico-directed evolution method enables user-directed Cas9 enzyme design, including for allele-selective targeting of the RHOP23H allele in human cells and mice. Together, PAMmla integrates machine learning and protein engineering to curate a catalogue of SpCas9 enzymes with distinct PAM requirements, motivating a shift away from generalist enzymes towards safe and efficient bespoke Cas9 variants.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Scalable characterization of hundreds of SpCas9 PAM variant enzymes.
Fig. 2: Development of a machine learning model to predict SpCas9 PAM preference from amino acid sequence.
Fig. 3: Characterization of the PAM requirements of PAMmla-predicted enzymes.
Fig. 4: Genome editing and off-target analysis in human cells with PAMmla-predicted enzymes.
Fig. 5: In silico-directed evolution of an allele-specific editor for the RHOP23H allele.

Similar content being viewed by others

Data availability

Primary datasets for this study are available in Supplementary Table 1 (HT-PAMDA data), Supplementary Table 2 (PAMmla predictions), Supplementary Table 6 (GUIDE-seq2 data) and Supplementary Table 7 (source data). The HT-PAMDA training datasets are also available on GitHub (https://github.com/RachelSilverstein/PAMmla). NGS results are available through the NCBI Sequence Read Archive under the accession code PRJNA1169103. PAMmla predictions for all 64 million SpCas9(6AA) enzymes can be viewed through an online webtool (https://pammla.streamlit.app/). The UniRef100 dataset used to generate multiple sequence alignments for natural sequence models can be downloaded at UniProt (https://www.uniprot.org/uniref/).

Code availability

Custom scripts, the PAMmla source code and the in silico-directed evolution code are available on GitHub (https://github.com/RachelSilverstein/PAMmla and https://github.com/RachelSilverstein/multiplex_seq_analysis).

References

  1. Nishimasu, H. et al. Engineered CRISPR–Cas9 nuclease with expanded targeting space. Science 361, 1259–1262 (2018).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  2. Hu, J. H. et al. Evolved Cas9 variants with broad PAM compatibility and high DNA specificity. Nature 556, 57–63 (2018).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  3. Walton, R. T., Christie, K. A., Whittaker, M. N. & Kleinstiver, B. P. Unconstrained genome targeting with near-PAMless engineered CRISPR-Cas9 variants. Science 368, 290–296 (2020).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  4. Miller, S. M. et al. Continuous evolution of SpCas9 variants compatible with non-G PAMs. Nat. Biotechnol. 38, 471–481 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Zhang, W. et al. In-depth assessment of the PAM compatibility and editing activities of Cas9 variants. Nucleic Acids Res. 49, 8785–8795 (2021).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  6. Hibshman, G. N. et al. Unraveling the mechanisms of PAMless DNA interrogation by SpRY-Cas9. Nat. Commun. 15, 3663 (2024).

  7. Mojica, F. J. M., Díez-Villaseñor, C., García-Martínez, J. & Almendros, C. Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology 155, 733–740 (2009).

    Article  CAS  PubMed  Google Scholar 

  8. Liu, G., Lin, Q., Jin, S. & Gao, C. The CRISPR–Cas toolbox and gene editing technologies. Mol. Cell 82, 333–347 (2022).

    Article  CAS  PubMed  Google Scholar 

  9. Pacesa, M., Pelea, O. & Jinek, M. Past, present, and future of CRISPR genome editing technologies. Cell 187, 1076–1100 (2024).

    Article  CAS  PubMed  Google Scholar 

  10. Anders, C., Niewoehner, O., Duerst, A. & Jinek, M. Structural basis of PAM-dependent target DNA recognition by the Cas9 endonuclease. Nature 513, 569–573 (2014).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  11. Sternberg, S. H., Redding, S., Jinek, M., Greene, E. C. & Doudna, J. A. DNA interrogation by the CRISPR RNA-guided endonuclease Cas9. Nature 507, 62–67 (2014).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  12. Jinek, M. et al. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 37, 816–821 (2012).

    Article  ADS  Google Scholar 

  13. Jiang, W., Bikard, D., Cox, D., Zhang, F. & Marraffini, L. A. RNA-guided editing of bacterial genomes using CRISPR-Cas systems. Nat. Biotechnol. 31, 233–239 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Kleinstiver, B. P. et al. Engineered CRISPR-Cas9 nucleases with altered PAM specificities. Nature 523, 481–485 (2015).

    Article  PubMed  PubMed Central  ADS  Google Scholar 

  15. Goldberg, G. W. et al. Engineered dual selection for directed evolution of SpCas9 PAM specificity. Nat. Commun. 12, 349 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Chatterjee, P. et al. A Cas9 with PAM recognition for adenine dinucleotides. Nat. Commun. 11, 2474 (2020).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  17. Zhao, L. et al. PAM-flexible genome editing with an engineered chimeric Cas9. Nat. Commun. 14, 6175 (2023).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  18. Chatterjee, P. et al. An engineered ScCas9 with broad PAM range and high specificity and activity. Nat. Biotechnol. 38, 1154–1158 (2020).

    Article  CAS  PubMed  Google Scholar 

  19. Kleinstiver, B. P. et al. Broadening the targeting range of Staphylococcus aureus CRISPR-Cas9 by modifying PAM recognition. Nat. Biotechnol. 33, 1293–1298 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Huang, T. P. et al. High-throughput continuous evolution of compact Cas9 variants targeting single-nucleotide-pyrimidine PAMs. Nat. Biotechnol. 41, 96–107 (2022).

  21. Wu, Y. et al. Genome-wide analyses of PAM-relaxed Cas9 genome editors reveal substantial off-target effects by ABE8e in rice. Plant Biotechnol. J. 20, 1670–1682 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Shi, H. et al. Rapid two-step target capture ensures efficient CRISPR-Cas9-guided genome editing. Mol. Cell 85, 1730–1742.e9 (2025).

  23. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

  24. Wu, Z., Jennifer Kan, S. B., Lewis, R. D., Wittmann, B. J. & Arnold, F. H. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl Acad. Sci. USA 116, 8852–8858 (2019).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  25. Wittmann, B. J., Yue, Y. & Arnold, F. H. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 12,1026-1045.e7 (2021).

    Article  CAS  PubMed  Google Scholar 

  26. Thean, D. G. L. et al. Machine learning-coupled combinatorial mutagenesis enables resource-efficient engineering of CRISPR-Cas9 genome editor activities. Nat. Commun. 13, 2219 (2022).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  27. Hie, B. L. & Yang, K. K. Adaptive machine learning for protein engineering. Curr. Opi. Struct. Biol. 72, 145–152 (2022).

  28. Wittmann, B. J., Johnston, K. E., Wu, Z. & Arnold, F. H. Advances in machine learning for directed evolution. Curr. Opin. Struct. Biol. 69, 11–18 (2021).

  29. Wu, Z., Johnston, K. E., Arnold, F. H. & Yang, K. K. Protein sequence design with deep generative models. Curr. Opin. Chem. Biol. 65, 18–27 (2021).

  30. Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

    Article  CAS  PubMed  Google Scholar 

  31. Makowski, E. K., Chen, H.-T. & Tessier, P. M. Simplifying complex antibody engineering using machine learning. Cell Syst. 14, 667–675 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42, 275–283 (2024).

  33. Saka, K. et al. Antibody design using LSTM based deep generative model from phage display library for affinity maturation. Sci. Rep. 11, 5852 (2021).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  34. Mason, D. M. et al. Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nat. Biomed. Eng. 5, 600–612 (2021).

    Article  CAS  PubMed  Google Scholar 

  35. Gupta, A. et al. An improved predictive recognition model for Cys2-His2 zinc finger proteins. Nucleic Acids Res. 42, 4800–4812 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Aizenshtein-Gazit, S. & Orenstein, Y. DeepZF: improved DNA-binding prediction of C2H2-zinc-finger proteins by deep transfer learning. Bioinformatics 38, ii62–ii67 (2022).

    Article  PubMed  Google Scholar 

  37. Ichikawa, D. M. et al. A universal deep-learning model for zinc finger design enables transcription factor reprogramming. Nat. Biotechnol. 41, 1117–1129 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotechnol. 39, 691–696 (2021).

    Article  CAS  PubMed  Google Scholar 

  39. Ogden, P. J., Kelsic, E. D., Sinai, S. & Church, G. M. Comprehensive AAV capsid fitness landscape reveals a viral gene and enables machine-guided design. Science 366, 1139–1143 (2019).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  40. Eid, F.-E. et al. Systematic multi-trait AAV capsid engineering for efficient gene delivery. Nat. Commun. 15, 6602 (2024).

  41. Malbranke, C. et al. Computational design of novel Cas9 PAM-interacting domains using evolution-based modelling and structural quality assessment. PLoS Comput. Biol. 19, e1011621 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Kleinstiver, B. P. et al. High-fidelity CRISPR-Cas9 variants with undetectable genome-wide off-targets. Nature 529, 490–495 (2016).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  43. Joung, K. & Kleinstiver, B. US20230407277A1 — engineered CRISPR-Cas9 nucleases with altered PAM specificity. Google Patents https://patents.google.com/patent/US20230407277A1/en (2023).

  44. Chen, Z. & Zhao, H. A highly sensitive selection method for directed evolution of homing endonucleases. Nucleic Acids Res. 33, e154 (2005).

    Article  PubMed  PubMed Central  Google Scholar 

  45. Quick, J. et al. Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples. Nat. Protoc. 12, 1261–1276 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Walton, R. T., Hsu, J. Y., Joung, J. K. & Kleinstiver, B. P. Scalable characterization of the PAM requirements of CRISPR–Cas enzymes using HT-PAMDA. Nat. Protoc. 16, 1511–1547 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Georgiev, A. G. Interpretable numerical descriptors of amino acid space. J. Comp. Biol. 16, 703–723 (2009).

    Article  CAS  Google Scholar 

  48. Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Preprint at https://arxiv.org/abs/1705.07874 (2017).

  49. Rees, H. A. & Liu, D. R. Base editing: precision chemistry on the genome and transcriptome of living cells. Nat. Rev. Genet. 19, 770–788 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Richter, M. F. et al. Phage-assisted evolution of an adenine base editor with improved Cas ___domain compatibility and activity. Nat. Biotechnol. 38, 883–891 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Neugebauer, M. E. et al. Evolution of an adenine base editor into a small, efficient cytosine base editor with low off-target activity. Nat. Biotechnol. 41, 673–685 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Newby, G. A. et al. Base editing of haematopoietic stem cells rescues sickle cell disease in mice. Nature 595, 295–302 (2021).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  53. Lazzarotto, C. R. et al. Population-scale cellular GUIDE-seq-2 and biochemical CHANGE-seq-R profiles reveal human genetic variation frequently affects Cas9 off-target activity. Preprint at bioRxiv https://doi.org/10.1101/2025.02.10.637517 (2025).

  54. Sweeney, C. L. et al. Correction of X-CGD patient HSPCs by targeted CYBB cDNA insertion using CRISPR/Cas9 with 53BP1 inhibition for enhanced homology-directed repair. Gene Ther. 28, 373–390 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. De Ravin, S. S. et al. CRISPR-Cas9 gene repair of hematopoietic stem cells from patients with X-linked chronic granulomatous disease. Sci. Transl. Med. 9, eaah3480 (2017).

    Article  PubMed  Google Scholar 

  56. Christie, K. A. et al. Towards personalised allele-specific CRISPR gene editing to treat autosomal dominant disorders. Sci. Rep. 7, 16174 (2017).

    Article  PubMed  PubMed Central  ADS  Google Scholar 

  57. Sung, C. H. et al. Rhodopsin mutations in autosomal dominant retinitis pigmentosa. Proc. Natl Acad. Sci. USA 88, 6481–6485 (1991).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  58. Dryja, T. P. et al. A point mutation of the rhodopsin gene in one form of retinitis pigmentosa. Nature 343, 364–366 (1990).

    Article  CAS  PubMed  ADS  Google Scholar 

  59. Hartong, D. T., Berson, E. L. & Dryja, T. P. Retinitis pigmentosa. Lancet 368, 1795–1809 (2006).

    Article  CAS  PubMed  Google Scholar 

  60. LaVail, M. M. et al. Ribozyme rescue of photoreceptor cells in P23H transgenic rats: long-term survival and late-stage therapy. Proc. Natl Acad. Sci. USA 97, 11488–11493 (2000).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  61. Li, P. et al. Allele-specific CRISPR-Cas9 genome editing of the single-base P23H mutation for rhodopsin-associated dominant retinitis pigmentosa. CRISPR J. 1, 55–64 (2018).

  62. Hsu, P. D. et al. DNA targeting specificity of RNA-guided Cas9 nucleases. Nat. Biotechnol. 31, 827–832 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Shin, J. W. et al. Permanent inactivation of Huntington’s disease mutation by personalized allele-specific CRISPR/Cas9. Hum. Mol. Genet. 25, 4566–4576 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  64. Courtney, D. G. et al. CRISPR/Cas9 DNA cleavage at SNP-derived PAM enables both in vitro and in vivo KRT12 mutation-specific targeting. Gene Ther. 23, 108–112 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  65. Ciciani, M. et al. Automated identification of sequence-tailored Cas9 proteins using massive metagenomic data. Nat. Commun. 13, 6474 (2022).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  66. Pedrazzoli, E. et al. CoCas9 is a compact nuclease from the human microbiome for efficient and precise genome editing. Nat. Commun. 15, 3478 (2024).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  67. Li, L. et al. Machine learning optimization of candidate antibody yields highly diverse sub-nanomolar affinity antibody libraries. Nat. Commun. 14, 3454 (2023).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  68. Weinstein, J. Y. et al. Designed active-site library reveals thousands of functional GFP variants. Nat. Commun. 14, 2890 (2023).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  69. Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).

    Article  CAS  PubMed  ADS  Google Scholar 

  70. Notin, P. et al. TranceptEVE: combining family-specific and family-agnostic models of protein sequences for improved fitness prediction. Preprint at bioRxiv https://doi.org/10.1101/2022.12.07.519495 (2022).

  71. Hopf, T. A. et al. The EVcouplings Python framework for coevolutionary sequence analysis. Bioinformatics 35, 1582–1584 (2019).

    Article  CAS  PubMed  Google Scholar 

  72. Ruffolo, J. A. et al. Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences. Preprint at bioRxiv https://doi.org/10.1101/2024.04.22.590591 (2024).

  73. Kleinstiver, B. P., Fernandes, A. D., Gloor, G. B. & Edgell, D. R. A unified genetic, computational and experimental framework identifies functionally relevant residues of the homing endonuclease I-BmoI. Nucleic Acids Res. 38, 2411–2427 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Gibson, D. G. et al. Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods 6, 343–345 (2009).

    Article  CAS  PubMed  Google Scholar 

  75. Alves, C. R. R. et al. Optimization of base editors for the functional correction of SMN2 as a treatment for spinal muscular atrophy. Nat. Biomed. Eng. 8, 118–131 (2024).

  76. Nelson, J. W. et al. Engineered pegRNAs improve prime editing efficiency. Nat. Biotechnol. 40, 402–410 (2022).

    Article  CAS  PubMed  Google Scholar 

  77. Christie, K. A. et al. Precise DNA cleavage using CRISPR-SpRYgests. Nat. Biotechnol. 41, 409–416 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  78. Robichaux, M. A. et al. Subcellular localization of mutant P23H rhodopsin in an RFP fusion knock-in mouse model of retinitis pigmentosa. Dis. Model. Mech. 15, dmm049336 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Chan, F., Bradley, A., Wensel, T. G. & Wilson, J. H. Knock-in human rhodopsin-GFP fusions as mouse models for human disease and targets for gene therapy. Proc. Natl Acad. Sci. USA 101, 9109–9114 (2004).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  80. Kleinstiver, B. P. et al. Engineered CRISPR–Cas12a variants with increased activities and improved targeting ranges for gene, epigenetic and base editing. Nat. Biotechnol. 37, 276–282 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Rohland, N. & Reich, D. Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture. Genome Res. 22, 939–946 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Tareen, A. & Kinney, J. B. Logomaker: beautiful sequence logos in Python. Bioinformatics 36, 2272–2274 (2020).

    Article  CAS  PubMed  Google Scholar 

  83. Ofer, D. & Linial, M. ProFET: feature engineering captures high-level protein functions. Bioinformatics 31, 3429–3436 (2015).

    Article  CAS  PubMed  Google Scholar 

  84. Chollet, F. keras. GitHub https://doi.org/https://github.com/fchollet/keras (2015).

  85. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  Google Scholar 

  86. Chen, H., Lundberg, S. M. & Lee, S. I. Explaining a series of models by propagating Shapley values. Nat. Commun. 13, 4512 (2022).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  87. Song, D., Xi, N. M., Li, J. J. & Wang, L. scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data. Bioinformatics 38, 3126–3127 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).

  89. Anzalone, A. V. et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576, 149–157 (2019).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  90. Clement, K. et al. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat. Biotechnol. 37, 224–226 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Amoli, M. M., Carthy, D., Platt, H. & Ollier, W. E. R. EBV immortalization of human B lymphocytes separated from small volumes of cryo-preserved whole blood. Int. J. Epidemiol. 37, i41–i45 (2008).

    Article  PubMed  Google Scholar 

  92. Tsai, S. Q. et al. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat. Biotechnol. 33, 187–197 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  93. Picelli, S. et al. Tn5 transposase and tagmentation procedures for massively scaled sequencing projects. Genome Res. 24, 2033–2040 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Tsai, S. Q., Topkar, V. V., Joung, J. K. & Aryee, M. J. Open-source guideseq software for analysis of GUIDE-seq data. Nat. Biotechnol. 34, 483 (2016).

    Article  CAS  PubMed  Google Scholar 

  95. Emsley, P. & Cowtan, K. Coot: model-building tools for molecular graphics. Acta Crystallogr. D Biol. Crystallogr. 60, 2126–2132 (2004).

  96. Anders, C., Bargsten, K. & Jinek, M. Structural plasticity of PAM recognition by engineered variants of the RNA-guided endonuclease Cas9. Mol. Cell 61, 895–902 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Goddard, T. D. et al. UCSF ChimeraX: meeting modern challenges in visualization and analysis. Protein Sci. 27, 14–25 (2018).

    Article  CAS  PubMed  Google Scholar 

  98. Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank L. Ma, E. Oliver, M. Prew and M. Welch for assistance with plasmid cloning; J. Lemanski, M. Talkowski and the Genomics and Technology Core in the Center for Genomic Medicine at the Massachusetts General Hospital (MGH) for technical support with the TapeStation; R. Mouro Pinto for access to and assistance with the Pippin Prep; J. Zhong and M. Suva for access to and assistance with the NextSeq2000; P. Chatterjee and G. Church for discussions; M. Ma at the NIH for technical assistance; Z. Hebert and M. Berkeley at the Dana-Farber Cancer Institute Molecular Biology Core Facilities for support with NextSeq500 sequencing; and W. Wang from St. Jude Protein Production Core Facility for recombinant Tn5. We acknowledge funding from Natural Sciences and Engineering Research Council of Canada Postgraduate Scholarship-Doctoral (PGS D-567791 to R.A.S.), a Chan Zuckerberg Initiative Award (Neurodegeneration Challenge Network, CZI2018-191853; to D.S.M.), a MGH Executive Committee on Research (ECOR) Fund for Medical Discovery Fundamental Research Fellowship Award (to K.A.C.), Peter und Traudl Engelhorn Stiftung (to M.P.), the Rappaport MGH Research Scholar Award 2024–2029 (to L.P.), the Fighting Blindness Foundation (to Q.L.), a MGH ECOR Howard M. Goodman Fellowship (to B.P.K.), the Kayden-Lambert MGH Research Scholar Award 2023–2028 (to B.P.K.), the Gilbert Family Foundation’s Gene Therapy Initiative Grant no. 521004 (to B.P.K.), and NIH grants TR01CA260415 (to D.S.M.), U01AI176470 (to S.Q.T.), R35HG010717 (to L.P.), UM1HG012010 (to L.P and B.P.K.), R01EY033107 (to Q.L.), P30EY014104 (MEE core support), DP2CA281401 (to B.P.K.) and P01HL142494 (to B.P.K.).

Author information

Authors and Affiliations

Authors

Contributions

R.A.S. and B.P.K. conceived of and designed the study. R.A.S. created the PAMmla model. R.A.S., N.K., A.-S.K., R.T.W., B.K.S., K.A.C. and L.L.H. designed and/or performed experiments related to the engineering and/or characterization of CRISPR–Cas enzymes, analysed data, created cell lines or cloned essential plasmids. J.D., R.A.S. and L.P. created the web interface for PAMmla. R.J.M., A.B.C. and G.A.D. generated mRNAs encoding base editors for BCL experiments. C.R.L., Y.L., A.M., E.O.U. and S.Q.T. advised on establishing the GUIDE-seq2 method in the Kleinstiver laboratory in advance of publication. M.P. and B.E.C. modelled protein structures. R.M.B. and Q.L. designed and performed the in vivo mouse experiments. A.D.S. and D.S.M. performed comparisons with evolutionary models. S.S.D.R. performed the experiments in CYBBT362I BCLs. B.P.K. contributed to experimental design, data analysis and oversaw the study. R.A.S. and B.P.K. wrote the manuscript, with contributions and/or revisions from all authors.

Corresponding author

Correspondence to Benjamin P. Kleinstiver.

Ethics declarations

Competing interests

R.A.S. and B.P.K. are inventors on a patent application filed by Mass General Brigham (MGB) that describes the development of PAMmla. B.P.K. and R.T.W. are inventors on additional patents or patent applications filed by MGB that describe genome engineering technologies related to the current study. S.Q.T. is an inventor on a patent application for GUIDE-seq, and is a member of the scientific advisory boards of Ensoma and Prime Medicine. L.P. has financial interests in Edilytics and SeQure Dx. Q.L. is a consultant for Entrada Therapeutics. B.P.K. is a consultant for EcoR1 capital, Novartis Venture Fund and Jumble Therapeutics, and is on the scientific advisory boards of Acrigen Biosciences, Life Edit Therapeutics and Prime Medicine. B.P.K. has a financial interest in Prime Medicine, Inc., a company developing therapeutic CRISPR–Cas technologies for gene editing. The interests of L.P. and B.P.K. were reviewed and are managed by MGH and MGB in accordance with their conflict-of-interest policies. The other authors declare no competing interests.

Peer review

Peer review information

Nature thanks David Bikard, Lennart Randau and Alan Wong for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Targeting range and characterization of previous engineered SpCas9 PAM variant enzymes.

(a) Quantification of pathogenic and likely pathogenic single nucleotide variants (SNVs) from ClinVar98 that are theoretically revertible using ABE or CBE based on their proximity to an NGG PAM. SNVs were considered editable if a GG dinucleotide PAM was available at the appropriate distance upstream on the correct DNA strand, positioning the SNV anywhere between positions 5–9 of the spacer sequence (counting from the PAM-distal end of the spacer; typically called the ‘edit window’ of base editors). (b-d) Heatmap representations of the PAM profiles of SpCas9 enzymes determined using the HT-PAMDA assay3,46, for wild-type SpCas9 (panel b), for enzymes with altered PAM requirements (e.g. SpCas9-VRQR14,42, SpCas9-VRER14, and xCas9 (ref. 2); panel c), and for enzymes with relaxed PAM requirements (e.g. SpCas9-NG1, SpG and SpRY3, and SpCas9-NRRH/NRCH/NRTH4); panel d). The log10 rate constants (k) are the mean of n = 2 replicate HT-PAMDA experiments performed using two distinct spacer sequences. Because the HT-PAMDA assay measures the relative depletion of substrates encoding various PAMs, it may underestimate rate constants for enzymes with highly relaxed PAM requirements such as SpRY and Cas9-NRRH.

Extended Data Fig. 2 Structure-informed saturation mutagenesis and bacterial positive selections for SpCas9 PAM variant enzymes.

(a) Structural representation of the PAM-interacting (PI) ___domain of SpCas9 showing amino acid residues interacting with a canonical NGG PAM (from PDB ID: 4UN3)10. (b) Schematic of the bacterial positive selection assay. A plasmid encoding the SpCas9(6AA) library (with randomized NNS codons at SpCas9 positions D1135, S1136, G1218, E1219, R1335, and T1337), a sgRNA expression cassette, and chloramphenicol resistance gene is transfected into an E. coli strain harboring a selection plasmid encoding an inducible toxic gene and the Cas9 target site (with protospacer adjacent to a non-canonical 4 nt PAM of interest). Selections were performed similar to previously described14,44,73, where the ccdB gene (encoding a DNA gyrase toxin) on the selection plasmid is induced by plating on arabinose-containing media. Bacterial colonies survive the selection when they harbor a plasmid that expresses an SpCas9 enzyme variant capable of cleaving the selection plasmid (by recognizing a non-canonical PAM). The schematic of the flask with yellow liquid was adapted from Clker (https://www.clker.com). (c) Summary of the SpCas9 enzymes that survived the bacterial positive selections using selection plasmids encoding each of the 16 NGNN PAMs. The heatmaps depict the percent of SpCas9 enzymes from each of the 16x selections that contain each possible amino acid substitution at each of the six SpCas9(6AA) library positions. Each heatmap is labeled based on the PAM utilized in that set of bacterial selections; the number of enzymes selected from each set of selections is indicated. The bottom panel represents a summary of the composition of amino acid residues at each of the six positions of the SpCas9(6AA) library.

Extended Data Fig. 3 Machine learning models to predict PAM profile from amino acid sequence.

a, Comparison of machine learning model architectures (linear regression, random forest, and neural network) and amino acid encodings (one-hot, one-hot plus all pairwise amino acid combinations, and Georgiev47). The R2 value is shown between the experimentally determined k (via HT-PAMDA) and the predicted k (via each ML model) for an internal 5-fold cross-validation on the training set. Each validation set is sub-divided according to the minimum hamming distance (HD) of each variant to the nearest neighbor in the corresponding training set; thus, validation sets become more challenging as HD increases. b, Performance of the optimal PAM machine learning algorithm (PAMmla; comprised of a neural network with one hot encoding) on two additional 80%/20% random train-test splits. c, Proportion of test set SpCas9 enzymes that have a predominant preference for A, C, G, T in the 3rd position of the PAM, or are inactive (based on HT-PAMDA data). d, Comparison of test set ks broken down by nucleotide preference of each test variant at the 3rd position of the PAM (comparing ks experimentally determined by HT-PAMDA versus predicted by PAMmla). Nucleotide preference is defined as the 3rd position nucleotide of each enzyme variant’s most preferred PAM by HT-PAMDA. e, Proportion of test set SpCas9 enzymes that have a predominant preference for A, C, G, T in the 4th position of the PAM, or are inactive (based on HT-PAMDA data). f, Comparison of test set ks broken down by preference of each test set variant at the 4th position of the PAM (comparing ks experimentally determined by HT-PAMDA versus predicted by PAMmla). Nucleotide preference is defined as 4th position nucleotide of each enzyme variant’s most preferred PAM by HT-PAMDA. g, Effect of random over-sampling by most active PAM. The PAMmla model was trained with and without randomly over-sampling the training set to balance the number of enzyme variants with different PAM preferences. R2 values for the two models were compared on subsets of variants within the test set with different preferences at the 3rd and 4th positions of the PAM. Over-sampling improved performance particularly for under-represented PAM classes (see panels c and e). h, Pearson’s correlations between HT-PAMDA replicates performed with distinct spacer sequences for a set of 28 inactive versus 28 active enzymes within the test set. Dashed line = data median. True labels for active versus inactive enzymes were determined using a cutoff value for maximum k on any PAM of 10−4.3. Enzymes separated into active and inactive classes based on these criteria showed correlation between replicates only for active enzymes, indicating HT-PAMDA data for enzymes with maximum ks below this cutoff are likely due to non-reproducible noise in the HT-PAMDA assay. i, Correlation between ks experimentally determined by HT-PAMDA versus predicted by PAMmla for inactive variants (maximum HT-PAMDA k < 10−4.3) within the test set; PAMmla is not predictive for background noise in the HT-PAMDA determined PAM profiles of inactive enzymes. For all panels that utilize HT-PAMDA data, the log10 rate constants (k) are the mean of n = 2 replicate HT-PAMDA experiments using two distinct spacer sequences. For all scatterplots, each datapoint represents the rate constant activity of one enzyme variant against on one of 64 possible NNNN PAMs.

Extended Data Fig. 4 PAMmla feature importance for enzymes targeting different PAM classes.

SHapely Additive exPlanations (SHAP)48 analysis to investigate the impact of amino acid substitutions (i.e. PAMmla features) on model output for each of the 16 NGNN PAMs. SHAP values are shown for 200 enzymes sampled from the training set. Top 10 features with highest mean absolute SHAP values (greatest absolute impact on model output) are plotted for each PAM.

Extended Data Fig. 5 Homology models of PAMmla predicted PAM-altering mutations.

a, An E1219Y substitution may facilitate interaction with the amino group of bases in the 3rd position of the PAM. b, R1335Q permits major groove readout of both bases of a C-G pair in the 3rd position of the PAM. c, E1219C, R1335M, and T1337V substitutions form a hydrophobic pocket to promote van der Waals interactions with the methyl group of thymine in the 3rd position of the PAM. Representation of the protein surface is colored by lipophilicity potential. d, T1337R results in direct major groove readout of guanine in the 4th position of the PAM. e, T1337K facilitates major groove readout of oxygen group of bases in the 4th position the PAM. f, R1335L and T1337C substitutions form a hydrophobic pocket to promote recognition of thymine in the 4th position of the PAM. Protein surface is colored by lipophilicity potential. g, D1135L disrupts coordination with R1114, enabling improved flexibility of the R1114 side chain to contact the NTS backbone. WT SpCas9 is overlaid in grey. h, Substitution of G1218 to a positive residue establishes additional non-specific contacts with the NTS backbone. i, S1136W and D1135L result in a shift of the NTS and TS backbone towards the PAM-interacting ___domain, enabling novel base specific interactions in nearby regions. WT SpCas9 is overlaid in grey. For panels ai, amino acid and PAM DNA base substitutions were modeled on the structure of SpG (PDB: 8U3Y)6 using Coot95, except for substitutions T1337R, T1337K, and T1337C which were modeled using SpCas9-VRER (PDB: 5FW3)50. Homology models were visualized using ChimeraX97.

Extended Data Fig. 6 Genome editing in human cells with PAMmla-predicted enzymes.

(a) PAMmla predicted ks for NGNN PAMs for enzymes targeting seven PAM categories. Hamming distances to the most similar enzyme in the training set are indicated in parentheses for each enzyme. (b) Nuclease-mediated genome editing efficiencies for each of the enzymes in panel a at endogenous target sites in HEK 293T cells harboring the PAMs they are predicted to target by PAMmla. Editing efficiencies were assessed by targeted amplicon sequencing and analyzed using CRISPResso2; data points are the mean of n = 3 biological replicates for enzymes from the training set (hamming distance = 0, shown with blue dots), enzymes predicted by PAMmla (shown in pink), SpG (gray), and wild-type (WT) SpCas9 (white); 3 to 10 genomic target sites were selected for characterization, where the black line represents median editing across all target sites for that enzyme; results at individual loci are shown in Supplementary Fig. 12a–g. (c,d) Base editing efficiencies for one PAMmla enzyme compared to SpG and SpRY, in the context of ABE8e and TadCBEd architectures (panels c and d, respectively). Base editing efficiencies were assessed by targeted amplicon sequencing for each enzyme at 3 endogenous target sites in HEK 293T cells; all edits at bases where any enzyme was observed to edit >5% efficiency are shown; Box minima, center and maxima represent data 25th, 50th, and 75th percentiles respectively; whiskers represent the range of the data. A-to-G and C-to-T base editing results at individual loci are shown in Supplementary Figs. 13a–g and 14a–g, respectively.

Extended Data Fig. 7 Genome-wide off-target analysis of PAMmla predicted enzymes.

a, Quantification of GUIDE-seq2 double-stranded oligodeoxynucleotide (dsODN) tag integration at the on-target site, in nuclease-based experiments with SpG, SpRY, and PAMmla predicted enzymes targeting endogenous target sites in HEK 293T cells. SpCas9 variant enzymes are named based on their amino acids at each of the six positions in the SpCas9(6AA) library. dsODN integration efficiency was assessed by targeted amplicon sequencing and modified reads were analyzed using CRISPResso2; mean, standard deviation, and individual datapoints shown for n = 3 technical replicates. b, Venn diagram representations of the GUIDE-seq-2 detected off-target sites that are shared between or unique to PAMmla generated, SpG, and SpRY nucleases. c, Nucleotide composition of PAMs adjacent to off-target spacers detected in GUIDE-seq-2 experiments, not including the on-target reads. The y-axis represents the fraction of total off-target GUIDE-seq-2 reads containing each nucleotide at each position of the PAM. d, Quantification of GUIDE-seq-2 double-stranded oligodeoxynucleotide (dsODN) tag integration at the on-target site, in nuclease-based experiments with KWRQLC and SpG when using the CYBB T362I sgRNA but targeting the wild-type genome of HEK 293T cells. SpCas9 variant enzymes are named based on their amino acids at each of the six positions in the SpCas9(6AA) library. dsODN integration efficiency was assessed by targeted amplicon sequencing and modified reads were analyzed using CRISPResso2; mean, standard deviation, and individual datapoints are shown for n = 3 technical replicates. e, GUIDE-seq-2 genome-wide specificity outputs for KWRQLC and SpG nucleases using the CYBB T362I targeted sgRNA; note that HEK 293T cells harbor the wild-type copy of the CYBB gene and are therefore an imperfect match to the sgRNA. Mismatched positions in the spacers of the off-target sites are highlighted in color; GUIDE-seq read counts from consolidated unique molecular events for each variant are shown to the right of the sequence plots.

Extended Data Fig. 8 Design and validation of in silico directed evolution.

a, Schematic of in silico directed evolution (ISDE) pipeline to rapidly identify bespoke SpCas9 enzymes with user-specifiable PAM profiles. bd, Effect of ISDE parameter values on the identification of optimized PAMmla predicted enzymes, including varying the number of starting mutations per round (m) (panel b), random variants generated per round (panel c) and number of additional evolution rounds performed once a plateau is reached before decreasing m (panel d). Proof-of-concept PAMmla-ISDE runs were performed to identify enzymes with maximal activity against NGAT, NGCC, or NGTA PAMs. Aside from the parameter being tested, ISDE was run with default parameters of 1,000 random starting sequences, m = 4 starting mutations per enzyme, s = 1,000 sampled enzymes per round, n = 10 top variants to keep per round, and p = 1 additional round of evolution after a plateau is reached. The number of true top 10 predicted enzymes, determined by exhaustive sorting of PAMmla predictions, recovered by ISDE are shown. Top bar graphs represent the number of replicates in which the most optimal enzyme was recovered.

Extended Data Fig. 9 Characterization of PAMmla-ISDE generated enzymes in human cells.

a, Nuclease-mediated genome editing at endogenous target sites in HEK 293T cells harboring different PAMs for wild-type (WT) SpCas9, SpG, and MRRWMR. b, Nuclease-mediated genome editing of the wild-type RHO or mutant RHO P23H alleles in a heterozygous RHO P23H HEK 293T cell line using wild-type SpCas9, SpG, and various PAMmla generated enzymes. For reads containing indels that span the P23H mutation (and therefore could not be identified as WT or mutant), counts were distributed between WT and mutant alleles with the same ratio as WT:mutant ratio observed for the identifiable edited reads. c, Nuclease-mediated genome editing of the RHO target site in wild-type HEK 293T cells using wild-type SpCas9, SpG, and various PAMmla generated enzymes. d, Unidentifiable sequencing reads that were either P23H or WT due to deletions spanning the mutation for data shown in heterozygous P23H HEK 293T cells from data in Fig. 5f; edited reads were distributed based on the balance in identifiable reads. e, Ratio of editing efficiencies observed on mutant (P23H) versus WT RHO alleles, for each editor tested in Fig. 5f. Editing efficiencies in panels ac,e were assessed by targeted amplicon sequencing and modified reads were analyzed using CRISPResso2; mean, standard deviation, and individual datapoints shown for n = 3 independent biological replicates.

Extended Data Fig. 10 Specificity assessment of PAMmla-derived enzymes.

a, Quantification of GUIDE-seq2 double-stranded oligodeoxynucleotide (dsODN) tag integration at on-target sites in nuclease-based experiments with MRRWMR, SpG, and SpRY and sgRNAs targeting two different endogenous sites in HEK 293T cells. SpCas9 variant enzymes are named based on their amino acids at each of the six positions in the SpCas9(6AA) library. dsODN integration efficiency was assessed by targeted amplicon sequencing and modified reads were analyzed using CRISPResso2; mean, standard deviation, and individual datapoints are shown for n = 3 technical replicates. b, Venn diagram representations of the GUIDE-seq-2 detected off-target sites that are shared between or unique to MRRWMR, SpG, and SpRY nucleases using the two sgRNAs targeted to sites with NGTG PAMs (similar to the RHO P23H on-target site). c, Fraction of GUIDE-seq-2 reads attributed to on- and off-target sites for MRRWMR, SpG, and SpRY from experiments using the NGTG-2 or NGTG-3 sgRNAs. d, Quantification of GUIDE-seq-2 dsODN tag integration at the on-target site for experiments in the homozygous RHO P23H cell line, when using the RHO P23H sgRNA and SpCas9-MRRWMR and -KRHWMR, SpG, and SpRY expression plasmids. dsODN integration efficiency was assessed by targeted amplicon sequencing and modified reads were analyzed using CRISPResso2; mean, standard deviation, and individual datapoints are shown for n = 3 technical replicates. e, GUIDE-seq-2 genome-wide specificity outputs for SpCas9-MRRWMR and -KRHWMR, SpG, and SpRY nucleases using the RHO P23H targeted sgRNA in homozygous RHO P23H HEK 293T cells. Mismatched positions in the spacers of the off-target sites are highlighted in color; GUIDE-seq read counts from consolidated unique molecular events for each variant are shown to the right of the sequence plots. f, Venn diagram representation of the GUIDE-seq-2 detected off-target sites that are shared between or unique to SpCas9-MRRWMR and -KRHWMR, SpG, and SpRY nucleases using the RHO P23H sgRNA. g, Unidentifiable sequencing reads unattributable to either WT or P23H alleles due to deletions spanning the base harboring the mutation, for data from heterozygous RHO P23H mice shown in Fig. 5l, h, Ratio of in vivo editing efficiencies observed on mutant (P23H) versus WT RHO alleles, for each SpCas9 nuclease tested in Fig. 5l.

Extended Data Fig. 11 Analysis of factors contributing to MRRWMR and KRHWMR PAM preferences.

a, Structural prediction of an alternative conformation of the S1136R mutation leading to additional hydrogen bonding with T at position 3 of the PAM. be, SHapely Additive exPlanations48 (SHAP) values for PAMmla predictions for MRRWMR (panels b,c) and KRHWMR (panels d,e) interacting with NGTG (panels b,d) or NGGG PAMs (panels c,e) PAMs. Feature values are shown in gray (1: mutation is present, 0: mutation is absent). Red represents features with positive impact on predicted rate constant and blue represent features with negative impact on predicted rate constant.

Supplementary information

Supplementary Information

This file contains Supplementary Notes 1–13, Supplementary Figures 1–19, and Supplementary References.

Reporting Summary

Supplementary Tables

This file contains Supplementary Tables 1–7.

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Silverstein, R.A., Kim, N., Kroell, AS. et al. Custom CRISPR–Cas9 PAM variants via scalable engineering and machine learning. Nature (2025). https://doi.org/10.1038/s41586-025-09021-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41586-025-09021-y

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing