Leveraging a phased pangenome for haplotype design of hybrid potato

Cheng, Lin; Wang, Nan; Bao, Zhigui; Zhou, Qian; Guarracino, Andrea; Yang, Yuting; Wang, Pei; Zhang, Zhiyang; Tang, Dié; Zhang, Pingxian; Wu, Yaoyao; Zhou, Yao; Zheng, Yi; Hu, Yong; Lian, Qun; Ma, Zhaoxu; Lassois, Ludivine; Zhang, Chunzhi; Lucas, William J.; Garrison, Erik; Stein, Nils; Städler, Thomas; Zhou, Yongfeng; Huang, Sanwen

doi:10.1038/s41586-024-08476-9

Download PDF

Article
Open access
Published: 22 January 2025

Leveraging a phased pangenome for haplotype design of hybrid potato

Nature volume 640, pages 408–417 (2025)Cite this article

27k Accesses
11 Citations
61 Altmetric
Metrics details

Subjects

Abstract

The tetraploid genome and clonal propagation of the cultivated potato (Solanum tuberosum L.)^1,2 dictate a slow, non-accumulative breeding mode of the most important tuber crop. Transitioning potato breeding to a seed-propagated hybrid system based on diploid inbred lines has the potential to greatly accelerate its improvement³. Crucially, the development of inbred lines is impeded by manifold deleterious variants; explaining their nature and finding ways to eliminate them is the current focus of hybrid potato research^{4,5,6,7,8,9,10}. However, most published diploid potato genomes are unphased, concealing crucial information on haplotype diversity and heterozygosity^11,12,13. Here we develop a phased potato pangenome graph of 60 haplotypes from cultivated diploids and the ancestral wild species, and find evidence for the prevalence of transposable elements in generating structural variants. Compared with the linear reference, the graph pangenome represents a broader diversity (3,076 Mb versus 742 Mb). Notably, we observe enhanced heterozygosity in cultivated diploids compared with wild ones (14.0% versus 9.5%), indicating extensive hybridization during potato domestication. Using conservative criteria, we identify 19,625 putatively deleterious structural variants (dSVs) and reveal a biased accumulation of deleterious single nucleotide polymorphisms (dSNPs) around dSVs in coupling phase. Based on the graph pangenome, we computationally design ideal potato haplotypes with minimal dSNPs and dSVs. These advances provide critical insights into the genomic basis of clonal propagation and will guide breeders to develop a suite of promising inbred lines.

Genome evolution and diversity of wild and cultivated potatoes

Article Open access 08 June 2022

Haplotype-resolved genome analyses of a heterozygous diploid potato

Article Open access 28 September 2020

Cultivar-specific transcriptome and pan-transcriptome reconstruction of tetraploid potato

Article Open access 24 July 2020

Main

Potato (S. tuberosum L.) is the most important tuber crop, feeding around 1.3 billion people annually in more than 120 countries¹. Recent studies have decoded the genomes of monoploid, diploid and tetraploid potatoes, resolved self-incompatibility and dissected the genetic basis of inbreeding depression^{4,5,6,7,8,9,10,11,12,13,14,15}. These efforts have led to the development of a first generation of highly homozygous inbred lines and subsequently, the first uniform hybrid as a proof-of-concept³, demonstrating the potential of genome design for revolutionizing potato breeding towards a fast, iterative mode.

A central focus of hybrid potato breeding programmes is development of a better understanding of deleterious variants^{4,5,6,7,8,9,10}, which compromise growth and overall fitness⁹. dSNPs have been widely studied in many species such as humans, dogs and rice^16,17,18. Structural variants (SVs), however, are expected to affect more genomic regions and are more likely to have stronger effects on fitness than SNPs; nonetheless, dSVs have rarely been studied. Phenotype-based selection is neither efficient nor accurate in purging deleterious variants closely linked in repulsion phase in diploid potatoes⁵. Haplotype-based selection based on in-depth understanding of deleterious variants and knowledge-guided recombination strategies are essential for integrating superior genomic fragments from different donors, paving the way for the eventual development of ideal potato haplotypes (IPHs).

Thus far, our knowledge of haplotype diversity in potatoes is limited. Only a few genomes have been published and most diploid potatoes are assembled into a single chimeric haploid genome, with only a few examples of haplotype-resolved assemblies^11,12,13. Collapsed assemblies of two haplotypes from diploid genomes lose valuable information on haplotype diversity and the phase state, introducing multiple types of errors, including switch errors, false duplications and nucleotide consensus errors^19,20. Haplotype-resolved assemblies provide a more accurate representation of genetic diversity and promise to become essential for future potato breeding^5,7,14. Concurrently, there is a need to shift from single reference genomes to phased pangenome references that better capture diversity across multiple populations²¹.

In this study, we carried out de novo assembly for 60 haplotypes from 31 diploid potatoes, including 20 haplotypes from 10 wild accessions, 38 haplotypes from 19 domesticated diploid accessions and two haplotypes from two inbred lines. From these 60 potato haplotypes, we constructed a potato pangenome graph, which enabled us to uncover extensive non-reference sequences and document multiallelic SVs. We further reveal the origin and fate of SVs in evolution and domestication and design IPHs for future breeding. Our study provides in-depth understanding of the haplotype diversity of a clonal crop, establishing the theoretical foundation for the reinvention of potato.

The potato pangenome graph

To capture and characterize haplotype diversity in potato populations, we selected 31 accessions based on previous studies of genetic diversity and principal component analysis (PCA) of 193 accessions^3,5,9,13 (Fig. 1a and Supplementary Figs. 1 and 2). The selected accessions comprise 19 landraces covering three indigenous diploid cultivated groups (S. tuberosum Group Phureja (PHU), S. tuberosum Group Stenotomum (STN) and S. tuberosum Group Goniocalyx (GON)), two inbred lines (A6-26 and E4-63, the 5th-generation selfed offspring of diploid PG6359 and E86-69, respectively)³ and 10 wild accessions from Solanum candolleanum (CND), which is believed to be the progenitor of domesticated potatoes²² (Supplementary Table 1). We generated an average of 28.15 Gb (approximately 38-fold) PacBio high-fidelity (HiFi) reads and 110.43 Gb (approximately 149-fold) high‐throughput chromosome conformation capture technique (Hi-C) reads (Supplementary Tables 2 and 3). Genome surveys using the HiFi reads revealed an average of around 1.55% genomic heterozygosity across cultivated diploid potato genomes (excluding inbred lines; Supplementary Table 4). Our initial de novo assembled haplotypes have an average genome size of 811 Mb, with contig N50 of 12.25 Mb (Fig. 1b and Supplementary Table 4). Paired haplotypes of each heterozygous diploid were further scaffolded on the basis of Hi-C contact maps (Extended Data Fig. 1a and Supplementary Fig. 3), resulting in pseudo-chromosome assemblies with an average anchor rate of 95.17% (Supplementary Table 5). Assessment using BUSCO²³ demonstrated a high level of completeness with an average of 99.2% (Supplementary Table 5).

**Fig. 1: Phased potato pangenome reference (PPG-v.1.0).**

Three additional independent analyses confirm the quality of our assemblies: (1) the low switch error rates (0.21% on average) and Hamming errors (1.05% on average) indicate the high correctness of our assemblies (Fig. 1c); (2) the high read mapping rate (mean 99.99%) and consistent k-mer spectrum (Extended Data Fig. 1b and Supplementary Fig. 4) confirm the completeness of the assemblies; and (3) to further estimate the reliability of our assemblies, we also used Flagger²⁴ to estimate the potentially erroneous regions. This confirmed that 98.9% of our assemblies are not collapsed or duplicated (Fig. 1d and Supplementary Table 6). We annotated an average of 449.0 Mb (59.69%) as being composed of transposable elements (TEs) (Supplementary Table 7) and predict 36,421–40,781 gene structure models for each haplotype (Supplementary Table 8).

To better represent non-reference sequences in potato genomes, we built two pangenome graphs for 60 haplotype assemblies plus the reference DMv6.1²⁵ using the PanGenome Graph Builder (PGGB)²⁶ and Minigraph-Cactus²⁷, respectively. We found that the Minigraph-Cactus pipeline removed highly diverged and large inverted regions, resulting in discarding of an average of 14.5% of sequence per chromosome (Supplementary Fig. 5). Therefore, we utilized PGGB to build a potato pangenome graph, which reduces reference bias²⁸ and is more sensitive for variation calling than methods based on mapping to a linear reference genome (Supplementary Fig. 6 and Supplementary Tables 9 and 10). This unbiased potato pangenome graph (PPG-v.1.0) comprises 248.64 million nodes and 345.61 million edges with total sequence length of 3,076 Mb. Wild potato accessions contribute a larger portion to the pangenome graph (2,286 Mb) than cultivated potato assemblies (1,807 Mb), indicating a higher pangenome graph growth rate in wild potatoes (Supplementary Table 11). We determined that PPG-v.1.0 consists of 365 Mb conserved sequences (present in at least 90% of the haplotypes) and 2,711 Mb variable sequences (Fig. 1e). Using the DMv6.1 reference as a baseline, PPG-v.1.0 also reveals prevalent rearrangements (Extended Data Fig. 2). Furthermore, we annotated non-reference nodes in each haplotype path: 52.6% of them overlap with TEs, 22.7% overlap with satellite repeats, and 23.8% are classified as non-repetitive sequences (Fig. 1f). Notably, 13.3% of repetitive sequences exhibit characteristics of both transposons and satellites (Supplementary Table 12) potentially originating from atypical centromere or subtelomere repeats²⁹ in the potato genome.

Finally, we deconstructed genome variation from the graph structure for small variants (SNPs and insertion–deletion mutations (indels) smaller than 50 bp in size) and structural variants (SVs ≥ 50 bp) based on DMv6.1 coordinates. In total, the constructed potato haplotype variation map includes 46,012,502 SNPs, 3,587,850 indels and 133,264 SVs. Notably, 16.9% of indels and 87.0% of SVs are multiallelic, with three or more distinct alleles. The graph generated by the Minigraph-Cactus pipeline shows a similar trend (Extended Data Fig. 1c,d and Supplementary Table 13). Collectively, the PPG-v.1.0 generated from phased diploid assemblies provides a fundamental resource for in-depth exploration of haplotype diversity and its consequences for potato breeding.

TEs drive SV formation

To reveal the mechanistic origins and dynamics of SVs in potato genomes, we analysed the pangenome SV sequences and their residual flanking sequences (±100 bp), which we annotated as one of three types: TEs, tandem repeats (TRs) and segmental duplications. We found that TEs account for 90.6% of repetitive elements associated with SVs (Fig. 2a and Supplementary Table 14). Additionally, 28.2% of SVs are covered by a single TE sequence (Supplementary Fig. 7), suggesting that these features are likely to reflect TE insertion events^30,31. Recent reports suggest that TEs could serve as substrates for ectopic DNA repair leading to SV formation^32,33. Therefore, we examined SV breakpoints and found that 33.8% could be classified as being compatible with ectopic recombination events from TE-mediated rearrangement (TEMR), characterized by both breakpoints (100 bp) having the same TE class (Supplementary Table 15, Supplementary Fig. 8 and Methods). This observed TEMR rate is significantly higher than the expected rate of 7.7% (P < 2.2 × 10^–16, two-tailed Student’s t-test), supporting a significant contribution of homologous TEs in mediating SV formation. TEMR events were further classified on the basis of TE families, revealing that Gypsy long terminal repeat retrotransposon (LTR/Gypsy)-mediated TEMR (42.4%) is seven times more prevalent than that involving Copia long terminal repeat retrotransposons (LTR/Copia (7.0%)) (Fig. 2b). Moreover, SVs mediated by LTR/Gypsy TEMR are significantly longer than those mediated by LTR/Copia TEMR (average of 7,005 bp versus 1,899 bp; P < 2.2 × 10^–16, two-sided Student’s t-test), and exhibit more recent and frequent activity (Fig. 2c and Extended Data Fig. 3).

**Fig. 2: TE activity drives SVs in potato haplotypes.**

To investigate the potential recent TE activity in potato genomes, we also compared two homozygous inbred lines with their heterozygous founders. Notably, we identified an approximately 4.0 Mb (chromosome (chr.) 6: 41.7–45.7 Mb) long terminal repeat retrotransposon (LTR)-mediated de novo paracentric inversion (Fig. 2d,e and Supplementary Fig. 9) in the fifth-generation inbred line E4-63, but not in the two founder haplotypes of the parental E86-69. Therefore, it is advisable to approach this de novo inversion with caution when implementing recurrent selection in hybrid potato breeding^34,35.

Domestication boosts heterozygosity

Previous studies suggested that there was increased heterozygosity in cultivated potatoes compared to wild ones²², consistent with heterozygosity evaluated in our study on the basis of k-mer statistics using HiFi reads. To better describe heterozygosity and the loss of collinearity between the two haplotypes within the same diploid individual, we introduce the concept of genome heterozygosity by sequence length (GHSL). This approach complements existing k-mer-based methods by quantifying the total sequence length of non-redundant variants (SNPs, indels and SVs) between two haplotypes in phased diploid assemblies (Supplementary Fig. 10). We found the GHSL to be approximately 93.8 Mb in our phased assemblies, constituting 12.5% of the average haplotype length (Fig. 3a and Supplementary Table 16). This estimate is similar to estimates by read mapping in clonally propagated grapevine³⁶. Moreover, our analysis shows that inversions account for a large proportion of haplotype-specific SVs (59.4% on average) (Supplementary Table 16).

**Fig. 3: Potato domestication enhanced haplotype divergence.**

We found that haplotype divergence in cultivated potatoes, as quantified by GHSL, is 1.46 times greater than that observed across wild potatoes (approximately 105.2 Mb versus 72.1 Mb; Fig. 3b, P = 2.6 × 10⁻⁴, two-sided Student’s t-test). Notably, two haplotype-specific inversions are quite prevalent in cultivated potatoes (Extended Data Fig. 4). One inversion has previously been reported and is associated with a locus that regulates yellow tuber flesh¹³ (DMv6.1, chr. 3: 42.9–48.7 Mb). The other inversion (DMv6.1, chr. 10: 52.7–59.1 Mb) contains 661 genes, including StPIN7 (Soltu.DM.10G026500.1) and StPTB6 (Soltu.DM.10G026670.1), which may be associated with auxin transport³⁷ and tuberization³⁸, respectively (Extended Data Fig. 5).

To explore the extent of genome heterozygosity and haplotype divergence during potato domestication, we conducted PCA separately for each chromosome and found that the two haplotypes of each wild potato form ‘nearest neighbours’, whereas the two haplotypes of each cultivated potato exhibit a more dispersed pattern (Fig. 3c and Supplementary Fig. 11). We constructed haplotype trees for each chromosome pair based on single-copy genes, using three additional Solanum species as outgroups (Fig. 3d and Supplementary Fig. 12). Analogous to the PCA-based patterns, the two haplotypes of each cultivated potato typically reveal more pronounced genetic divergence than those of each wild potato. We conducted additional validation through pairwise Jaccard similarity analysis and alignment for split windows, both of which indicate that multiple hybridization events have affected the haplotype landscape in cultivated potatoes (Extended Data Fig. 6 and Supplementary Fig. 13). Molecular evidence in maize hybrids suggests that the length of nonsyntenic regions between parental lines is strongly correlated with levels of heterosis³⁹. Our findings of increased genome heterozygosity and haplotype divergence in cultivated potatoes are similar to those in grapevine, another clonal crop⁴⁰. The enhanced genome heterozygosity in cultivated potatoes compared with wild potatoes results in reduced homozygous deleterious burden (Extended Data Fig. 7a,b), indicating that potato domestication included an exploration of heterosis.

Fate of dSVs under breeding

To characterize the fitness effects of SVs in potato cultivars, we computed the unfolded site frequency spectrum on the basis of SVs and SNPs from domesticated potatoes (Fig. 3e). We found that SVs are over-represented for singletons and other minor-allele frequency classes, compared to both synonymous SNPs and even non-synonymous SNPs at minor frequencies, consistent with equivalent observations in grapevine genomes³⁶. Cultivated potatoes have around 2.7 times more heterozygous SVs (average of 20,613 per individual) than homozygous SVs (average of 7,561) (Extended Data Fig. 7a). Additionally, 39.4% of the heterozygous SVs are located either in the gene bodies or in putative promoter regions (Supplementary Table 17). These findings suggest that SVs are more likely to be strongly deleterious than SNPs and are typically present in heterozygous state with relatively low frequencies across diploid potato cultivars.

The presence of dSVs in heterozygous state could lead to strong inbreeding depression upon forced inbreeding, thus hindering the development of inbred lines, with important implications for hybrid potato breeding^3,9,41. Inferred purifying selection against SVs is stronger than against SNPs, implying that SV frequencies can be useful for an initial assessment of putative selection pressure^42,43. Recently, deep phylogenomic analyses of 92 species in the family Solanaceae assessed genome-wide evolutionary constraints (with highly constrained regions amounting to 31.7 Mb under the threshold genomic evolutionary rate profiling (GERP) score ≥ 2.0, including non-synonymous, synonymous and non-coding sites), forming the basis for identifying and quantifying dSNPs at genome-wide scales in potato¹⁰. Analogous to dSNP identification, we developed an unweighted approach to infer putative dSVs in the potato genome on the basis of the joint criteria of evolutionarily derived state, low allele frequency and being within gene-coding or evolutionarily constrained regions (Fig. 4a and Methods). In total, we characterized 19,625 dSVs, 50.4% of which are located within non-coding regions (Extended Data Fig. 7c,d and Supplementary Table 18). For example, we identified a 51.1-kb deleterious deletion occurring in only one haplotype of cultivar PG6090. This deletion leads to truncated proteins or gene loss and is associated with three genes (Supplementary Fig. 14).

**Fig. 4: Fate of dSVs in domestication and inbred line development.**

Excluding the inbred lines, we found that each diploid potato individual has an average of 843 dSVs, comprising 23.1 Mb of genomic sequence. Our analyses revealed that 73% of dSVs are in heterozygous state across wild potatoes, and this proportion increases to 97% in cultivated lines (Fig. 4b and Supplementary Fig. 15). These patterns indicate that dSVs tend to be sheltered in heterozygous state, potentially largely avoiding negative selection on them^36,44. Thus these heterozygous dSVs require attention from breeders when choosing suitable materials for breeding.

Purging dSVs is crucial for developing elite inbred lines in hybrid potato breeding. We established that the available inbred lines, A6-26 (88 dSVs) and E4-63 (303 dSVs), carry fewer dSVs compared to the average haplotypes from the two parents, PG6359 (135 dSVs) and E86-69 (351 dSVs), respectively (Fig. 4c and Supplementary Fig. 16). To investigate the process of purging dSVs at the haplotype level, we constructed a genome-wide recombination map based on inbred lines and their respective parental haplotypes (Extended Data Fig. 8). Our results track at least 16 recombination events in one of the inbred lines across 5 generations of self-fertilization. For example, the 2 haplotypes from accession PG6359 carry 17 and 11 dSVs on chromosome 2; recombination events decreased the number of dSVs to seven in the derivative inbred line A6-26. For chromosome 4 of PG6359 H2, which carries fewer dSVs than PG6359 H1 (14 versus 30), 85.2% of dSVs are retained in the inbred line A6-26 (chr. 4: 7.5–59.5 Mb; 14 dSVs) (Fig. 4d and Supplementary Table 19).

The ‘broken-window’ effect of dSVs

On the basis of the genome-wide map of deleterious variants, we found a significant positive correlation between the numbers of dSVs and dSNPs per haplotype (r = 0.78, P < 2.2 × 10^–16, Pearson correlation coefficient; Fig. 4e and Supplementary Tables 19 and 20). In diploid genomes, we define the coupling phase as the haplotype that contains a focal dSV, whereas the repulsion phase refers to the other haplotype. To assess the potential effect of dSVs on dSNPs, we compared the number of dSNPs surrounding dSVs in the coupling phase to the number of dSNPs in the syntenic region of the repulsion phase. We partitioned the genome into intervals on the basis of dSV distribution across both haplotypes and calculated statistical significance to identify affected genomic regions (Extended Data Fig. 9a and Methods). Our results suggest that dSNPs are significantly more frequently present in dSV coupling phase than in the dSV repulsion phase (Fig. 4f, Extended Data Fig. 9b and Supplementary Table 21). This accumulation signal of dSNPs tends to be more pronounced in close proximity to dSVs.

Furthermore, we investigated whether dSVs form clusters by calculating two types of distances: the distance between the focal dSV to the nearest dSV in coupling phase (same haplotype; dis-coupling), and its distance to the nearest dSV in repulsion phase, (dis-repulsion) (Extended Data Fig. 9c). Dis-repulsion is significantly larger than dis-coupling (Fig. 4g, 264.1 kb versus 202.1 kb, P < 2.2 × 10^–16, two-sided Student’s t-test), suggesting that dSVs tend to form clusters within the same haplotype (Extended Data Fig. 9d). We provide an illustrative example of this phenomenon using the diploid potato line E86-69 (Supplementary Fig. 17).

The presence of dSVs, enhancing the occurrence of dSVs and dSNPs in the same phase, is analogous to a sociological hypothesis called the broken-window effect⁴⁵. Therefore, we refer to this pattern as the ‘broken-window’ effect of dSVs. We validated this pattern by examining local assemblies, which cover the 10-kb regions flanking dSVs with an average coverage of 19 reads, as well as 6 long reads spanning the flanking regions. Theoretical considerations as well as empirical evidence suggest that SVs affect recombination rates of closely linked genomic regions^46,47,48. Those associations may result from insufficient purging around the dSV owing to reduced recombination, whereas the dSV repulsion phase tends to remain functional, thus preventing strong selection on deleterious recessive variants^9,49, leading to gradual differences in deleterious burden between the two haplotypes (Fig. 4h). The broken-window effect reveals that dSVs and dSNPs are not randomly distributed across potato genomes, as they tend to form clusters in coupling phase. Such deleterious clusters are important features of potato genomes, and they are easy to spot in our phased pangenome. We found that many deleterious variants were not efficiently purged during our previous development of two inbred lines³ (A6-26 and E4-63). Line A6-26 carries 88 dSVs and 61,017 dSNPs, whereas E4-63 has 303 dSVs and 71,200 dSNPs (Supplementary Tables 19 and 20). These deleterious variants in inbred lines substantially reduce their fitness³ and thus need to be purged as a foundation for future potato breeding.

Genome design of ideal haplotypes

Plant breeders adopted the concept of ideal plant architecture (IPA) to guide breeding of superior varieties by combining multiple desirable traits⁵⁰. Similar to the IPA, we propose the IPHs strategy to guide breeders to develop a suite of inbred lines that are as close as possible to the ideal genotype (see Supplementary Methods).

Based on PPG-v.1.0, the starting point of computationally designed IPHs-v.1.0 are two heterotic groups; nine cultivars from the STN and GON groups with the inbred line A6-26 (heterotic group A), and eight cultivars from the PHU group with the inbred line E4-63 (heterotic group E) (Extended Data Fig. 10a,b). In principle, recombination events can reduce genetic burden in simulated haplotype combinations (Extended Data Fig. 10c,d). On the basis of the recombination map of two inbred lines (Extended Data Fig. 8), we assumed no more than four recombination events per chromosome in our IPHs. The consideration of recombination breakpoints takes into account the recombination coldspots in certain regions, such as heterochromatic centromeric regions and inversions. We estimated the distribution of dSVs and dSNPs under various haplotype donors and recombination events to identify the optimal combinations of IPHs, IPHs-A and IPHs-E (Supplementary Fig. 18).

We found that ideally, all dSVs could be removed from IPHs-A and IPHs-E and the number of dSNPs could be reduced to 41,262 and 35,360, implying a reduction of 32.4% (A6-26) and 50.3% (E4-63) compared with the previous inbred lines (Fig. 5a,b and Supplementary Table 22). Haplotype distances predict that hypothetical F₁ hybrid genomes between IPHs-A and -E would carry only 10,830 homozygous dSNPs (Fig. 5c), implying a reduction of 54.5% compared with the previous F₁ hybrid potato³. By integrating the genetic linkage map⁵¹, we found that achieving a IPHs-A chromosome 1 would require at least five generations of recurrent selection from hybrid progenies, and that one of the designed breakpoints is located in a low-recombination region (near chr. 1: 44.3 Mb), requiring a population size of at least 1,321 individuals to obtain at least 1 copy of IPHs-A chromosome 1 (Supplementary Fig. 19). It is worth noting that relying solely on natural recombination is unlikely to be sufficient (Supplementary Fig. 20). New technologies such as targeted recombination, gene editing and synthetic biology to precisely eliminate deleterious variants may provide future pathways for approaching IPHs^52,53,54.

**Fig. 5: IPHs designed from the phased potato pangenome graph.**

Discussion

In this study, we developed a graph-based phased potato pangenome reference comprising 60 haplotype sequences, reducing reference bias and identifying more variants that remain unaccounted by reads-based methods^55,56. Our phased pangenome provides an example of a typical clonally propagated crop with a highly heterozygous genome. The integration of pedigree information and high-accuracy ultra-long reads can further improve phase accuracy in highly repetitive regions.

Consistent with observations in grapevine³⁶, SVs are over-represented among singletons and other minor-allele frequency classes, suggesting that SVs are under stronger purifying selection. SVs can be associated with agronomic traits^57,58,59, and by disrupting gene structure and cis-regulatory elements, dSVs can exert strong effects. We classified 14% of all SVs to be putative dSVs. However, inference of dSVs shows lower efficacy in certain cases owing to overlapping SVs and ambiguous ancestral state. Therefore, quantifying their fitness penalties will be crucial to guide breeding practices in the future.

Asexually reproducing crops tend to accumulate more heterozygous deleterious mutations^36,49. The broken-window effect of dSVs may be reinforced owing to very limited recombination under long-term clonal propagation. After one haplotype is hit by a deleterious mutation with major effects, it seems less likely that deleterious mutations accumulate in the dSV repulsion phase, otherwise the affected individual might not survive to reproduce. This inference underscores the importance of preserving at least one functional copy of genes to maintain genetic robustness^60,61. Eradicating dSVs is therefore a new focus in the development of inbred lines to generate diverse hybrid potato combinations, a principle that should also be relevant for other clonally propagated crops.

To develop inbred lines, it is more favourable to start from haplotypes of diploid landraces with a lower deleterious burden¹⁰. The relatively high deleterious burden per haplotype in a dihaploid generated from tetraploid varieties poses formidable obstacles for this step (Supplementary Fig. 21). Therefore, we plan to develop the first generation of inbred lines from diploid landraces concurrently with inducing dihaploids from elite tetraploid varieties. The IPHs strategy provides a blueprint for the combination of haplotypes to reduce the negative impact of deleterious variants. Although achieving IPHs at the chromosome level through crossing is possible, it is challenging to combine the ideal chromosomes into one genome without additional recombination (Supplementary Figs. 22 and 23). Currently, the primary value of IPHs lies in their ability to outline an optimal goal for haplotype recombination, allowing breeders to compare real haplotypes, including dSVs and dSNPs, against an idealized standard, thus facilitating refinement of breeding strategies.

Methods

Plant materials and sequencing

We selected 31 potato accessions based on phylogenetic trees generated by previous studies. Of these accessions, 10 are from S. candolleanum and 21 are from 3 groups of diploid potato cultivars. Among these potato cultivars, nine accessions represent S. tuberosum Group Phureja (including inbred line E4-63), eight accessions represent S. tuberosum Group Stenotomum (including inbred line A6-26), and four are from S. tuberosum Group Goniocalyx. Previous HiFi data are available for 21 accessions and previous Hi-C reads are available for five accessions (PRJNA754534). Additionally, we newly sequenced 10 accessions using HiFi and 25 accessions using Hi-C in this study. HiFi reads for 10 accessions were generated using the ccs program version 6.4.0 (https://github.com/PacificBiosciences/ccs) and subreads obtained from the Pacific Biosciences Sequel II platform, which were then converted to FASTQ format by SAMtools (v.1.17)⁶². In total, we generated 318.49 Gb of HiFi data, ranging from 27.72 to 35.32 Gb per sample. For Hi-C libraries, DNA was extracted from in vitro seedlings and digested with the restriction enzyme MboI using previously described Hi-C library preparation protocols^63,64. A total of 2.70 Tb of Hi-C data were generated based on the Illumina HiSeq platform. To facilitate genome annotation, we used RNA-sequencing (RNA-seq) data from six plant tissues (including roots, stems, leaves, stolon, tubers and flowers) from a previous study¹³.

De novo genome assembly of 60 haplotypes

For genome size and heterozygosity estimation, we used Jellyfish (v.2.3.0)⁶⁵ to obtain a frequency distribution of the k-mers and estimated the histograms by GenomeScope (v.2.0)⁶⁶. We then assembled haplotype-resolved assemblies with the HiFi reads and Hi-C reads using hifiasm (https://github.com/chhylp123/hifiasm) (v.0.16)⁶⁷ with default parameters. Subsequently, we aligned haplotypes using the Juicer pipeline and then generated Hi-C maps for the 3D-DNA pipeline (v.201008)⁶⁸ with parameter “-q 0”. The assembly from two haplotypes was scaffolded and ordered using the Hi-C data-based rough scaffold and compared with the reference (accession DM1-3 516 R44)²⁵. Two sets of pseudo-chromosomes were constructed for each accession, and this workflow was applied to all 29 diploid potato accessions, resulting in 58 haplotypes. For the two inbred lines E4-63 and A6-26, we generated haploid assemblies. The Hi-C format file was visualized using the Juicebox program (v.2.16.00)⁶⁹ and misassembled contigs were manually curated. Finally, each haplotype was integrated into the 3D-DNA pipeline based on the run-asm-pipeline-post-review.sh script. To ensure the accuracy of the assemblies, we filtered out contigs smaller than 50 kb and used chromosome-based haplotypes for subsequent analyses.

Evaluation of the genome assemblies

The contig N50 statistic was calculated using assembly-stats (https://github.com/sanger-pathogens/assembly-stats). BUSCO (v.5.4.4)²³ was used to evaluate assembly quality with the embryophyte_odb10 protein database. Both switch and Hamming errors were calculated using two variant call format (VCF) files produced by the pipeline calc_switchErr (https://github.com/tangerzhang/calc_switchErr), based on the ‘compare’ function of WhatsHap (v.1.1)⁷⁰. Genome completeness was estimated via the KAT (v.2.4.2)⁷¹ ‘comp’ command. Reliability of our assemblies was estimated by Flagger, a reads-mapping-based pipeline optimized for diploid assemblies²⁴. A detailed workflow and commands can be found in the Supplementary Information and GitHub (https://github.com/Chenglin20170390/Haplotype-diversity).

Annotation of repetitive elements

Different repeat elements are identified based on 60 haplotypes, including TEs, segmental duplications, TRs and satellite repeats. To identify TEs, we employed the Extensive de novo TE Annotator (EDTA) (v.2.1.0)⁷², including LTRs, DNA transposons with terminal inverted repeat sequences and helitron transposons. For the remaining transposon elements, RepeatModeler⁷³ was used to search for a second round. To search for large segmental duplications (>1 kb), we employed Asgart (v.2.4.0)⁷⁴ using a k-mer-based method with the parameter “-CRSV”. TRs were identified using Tandem Repeats Finder (v.4.09.1)⁷⁵ with default parameters. Satellite repeats were identified using Satellite Repeat Finder (SRF, commit e54ca8c; https://github.com/lh3/srf) with HiFi reads.

Prediction of protein-coding genes

A comprehensive strategy consisting of transcript evidence, ab initio prediction, and homology alignment was applied for gene prediction. First, we aligned RNA-seq reads to assembled haplotypes using HISAT2 (v.2.2.1)⁷⁶ with the “--dta” parameter and then assembled by StringTie (v.2.2.1)⁷⁷ with the “--rf” parameter. Subsequently, we used the BRAKER2 (v.2.1.5)⁷⁸ program to train the ab initio prediction model from AUGUSTUS (v.3.4.0)⁷⁹ (https://github.com/Gaius-Augustus/Augustus) and collected high-quality RNA-seq hints using the Hidden Markov Model (HMM) from GeneMark-ET (v.3.67)⁸⁰ with the parameter “--nocleanup --softmasking”. To improve gene structure prediction, we performed a homology search using a curated plant protein dataset downloaded from the UniProt Swiss-Prot database (https://www.uniprot.org/downloads). We merged these homologous proteins with previously published peptides from tomato⁸¹ and potato⁴ and eliminated potential redundancy using the CD-HIT-est (v.4.6.8)⁸² program with default parameters. The MAKER2 (v.3.01.03)⁸³ program was used to combine the homology search, expression evidence and ab initio prediction through two rounds. Finally, we used the Mikado (v.2.3.4)⁸⁴ program to identify the representative set of transcripts from transcript assemblies, before those were fed to the PASA pipeline (v.2.5.1)⁸⁵ to update gene structures.

To perform functional gene annotation, we utilized the InterProScan (v.5.34-73.0)⁸⁶ program, which identifies potential protein domains and Gene Ontology terms based on sequence signatures. We applied the following parameters to the program: “-cli -iprlookup -tsv -gotermd -appl Pfam”. In particular, we extracted protein ___domain information from Pfam by enabling the “-appl Pfam” parameter. For each of the genes, we assigned the functional description of the best hit.

Construction of the potato pangenome

The Minigraph-Cactus pipeline²⁷ and PGGB²⁶ were used to construct a pseudo-phased pangenome with all 61 haplotypes based on the whole-genome alignment (including the DMv6.1 reference genome). For the PGGB, we estimated the divergence of each chromosome with mash distances and confirmed chromosome community with wfmash⁸⁷ mapping. Then, we used “pggb -s 10000 -n 61 -p 90 -k 47 -P asm20 -O 0.001” to build each chromosome graph. We visualized the 1D layout of the graph and estimated presence and absence ratios to the DMv6.1 reference in 100-kb sliding windows using ODGI⁸⁸. The small variants and SVs were detected by vg deconstruct from snarls, and we only kept top-level and <1-Mb variants with vcfbub. For the Minigraph-Cactus pipeline, we assigned DMv6.1 as the guide for the paths, and progressively aligned the 60 haplotypes to it. We used the cactus-pangenome script with parameters “--gfa full --gbz full --vcfReference DMv6.1” to generate complete workflows and execute commands. The generated graph fragment assembly (GFA) format graph was used for edge, node and coverage statistics and subgraph generation from a BED input. The VCF output file comprises all graph variations based on the DMv6.1 reference, enabling the calculation of polymorphisms.

Pangenome size and growth

To fully capture the genome diversity of our potato populations, we used Panacus⁸⁹ to assess pangenome size and growth ratio, which estimates pangenome openness directly by applying the binomial formula. We calculated cumulative bases based on quorum (minimum fraction of haplotypes sharing a graph feature after haplotypes are sequentially added to the growth histogram), and proportion of conserved (≥90% of haplotypes) and variable sequences (<90% of haplotypes) in the pangenome graphs. These statistics were obtained using Panacus hist and growth scripts with parameters “-c bp -l 1,1,1 -q 0,0.1,0.5,0.9 -S”, and we selected different haplotype paths for group-specific statistics.

Phylogeny and synteny analysis

To build haplotype phylogenetic trees at the chromosome level, annotated protein sequences from the 60 newly assembled haplotypes and three published genomes (outgroups: S. wrightii¹⁰; S. etuberosum¹³ and S. lycopersicum⁹⁰) were aligned to produce all-versus-all alignments using Diamond (v.0.9.21)⁹¹. The gene families were inferred using the OrthoFinder (v.2.5.4)⁹² program, which utilizes the Markov cluster algorithm. Separately for each chromosome pair, all single-copy orthologous protein sequences were merged into a single FASTA file, which was then fed into IQ-TREE (v.2.0.6)⁹³ using the maximum-likelihood method. Each chromosome tree was visualized with the function ggtree (v.3.0.2)⁹⁴ in R (v.4.2.0). Synteny plots and pangenome annotations were generated by the R package Genespace (v.0.9)⁹⁵.

Identification of SNPs, indels and SVs

Assembly-based calling was used to detect SVs with a minimum length of 50 bp. Genome sequences were aligned to the DMv6.1 reference genome to produce alignment BAM files using Minimap2 (v.2.17)⁹⁶ with “-ax asm5”. SyRI (v.1.5)⁹⁷ was then utilized to call genome-wide variants and generate assemblies-based VCFs. We kept variants of Ins, Del, Inv and Dup from the output of SyRI as the individual SV dataset. The accuracy of the SV dataset was validated by randomly selecting and manually verifying 100 SVs longer than 100 kb using Hi-C maps. Finally, we merged the individual SVs based on 80% overlap using SURVIVOR with the parameters “0.8 1 1 -1 -1 -1” to produce the final population SVs file. For SNPs and indels, we merged the output of assembly-based variation calling from SyRI using BCFtools (v.1.13)⁹⁸.

Haplotype diversity and GHSL statistic

The length of the haplotype-specific (heterozygous) variants for each diploid accession was calculated by summing all variants present in only one haplotype. In detail, we aligned the two haplotypes of one accession to each other using Minimap2 (v 2.17), identified variants by SyRI (v.1.5), and then calculated the number of haplotype-specific SNPs and the length of haplotype-specific variants (indels and SVs). The GHSL approach considers reference bias and regions potentially affected by haplotype-specific variants (https://github.com/Chenglin20170390/Haplotype-diversity). PCA based on haplotypes (12 chromosomes) was performed using Plink (v.1.90)⁹⁹ on the VCF file with the parameter “--pca 5”; the results were visualized using the R package ggplot2¹⁰⁰. Pairwise Jaccard similarities between haplotypes were estimated using BEDTools (v.2.30.0) with the ‘jaccard’ command.

SV dynamics in potato domestication

To calculate the unfolded site frequency spectrum, we analysed 19 cultivated potato accessions (excluding the two inbred lines) from VCFs. The genome of S. lycopersicum (SL 5.0⁹⁰) was used as an outgroup to infer the ancestral states of genotypes. We distinguish different types of SNPs (synonymous and non-synonymous) and SVs (deletions, insertions, inversions and duplications) and counted the number of haplotypes carrying the derived variants. For each accession, we also calculated the incidences (numbers) of heterozygous and homozygous SVs and the sums of all SVs (twice the number of homozygous SVs plus the number of heterozygous SVs).

Identification of TEMRs

SV breakpoints with additional ±100-bp flanking sequences were used to identify repeat elements (TEs, segmental duplications and TRs) associated with SVs. To detect the potentially causal relationship between TE movement and SV formation, the documented genomic position of a focal SV and overlapping TEs determine the following categories: ‘No TE SV’ for SVs without any TE overlap, ‘Incomplete TE SVs’ for SVs overlapping with a single TE with <95% coverage, ‘Single TE SVs’ for SVs overlapping with a single TE with coverage ≥95%, and ‘Multi TE SVs’ for SVs overlapping with multiple TEs with coverage >95%. Since no single TE SV events were considered, insertions were excluded. To quantify the formation of SVs (deletions, inversions and duplications) by ectopic recombination via TEMRs, SV breakpoints (±100 bp) overlapping with the identical class of TEs were calculated. To assess the distribution of repeats in SV breakpoints, we randomly simulated SVs using the shuffle function of BEDTools with the number and length of SVs set according to the potato SV variation map. Two-sided Student’s t-tests were used to test the significance of the proportions between observed and simulated TE-related SVs. To infer the insertion times of TEMRs, we extracted insertion times from the pass.list of EDTA and kept intact TE sequences overlapping with the breakpoints of SVs. Sequence homology of TEMRs was calculated using 200-bp flanking sequences based on global pairwise sequence alignment from Needle (https://github.com/nanjiangshu/my_needle).

Haplotype recombination in inbred lines

The fifth-generation inbred lines A6-26 and E4-63 were generated from the diploid cultivars E86-69 and PG6359, respectively. To evaluate haplotype recombination events in each inbred line, we utilized two different methods: heterozygosity analysis and haplotype-specific k-mer analysis. Genome-wide heterozygous peaks were identified by aligning the sequences of the haplotypes. The different heterozygous peaks were compared between haplotypes from inbred lines and their founders. The regions inherited from the founders’ haplotypes in the alignment would be in homozygous state. These switch signals from homozygosity to heterozygosity were used to characterize putative recombination events. Haplotype-specific k-mers were calculated by the number of specific k-mers from the inbred line based on non-overlapping 500-kb regions and compared with parental haplotypes using the Meryl program (available at https://github.com/marbl/meryl).

Identification of dSVs and dSNPs

Putative dSVs were identified based on three criteria: (1) ancestral versus derived state of SVs. S. lycopersicum (SL 5.0)⁹⁰ was used as the outgroup for ancestral state inference, as phylogenetic analyses have placed this species in a clade relatively close to Solanum section Petota¹³. (2) SVs with frequencies below 0.05 in our 60 haplotypes were considered, as most deleterious variants occur at low population frequencies^16,42,43. (3) SVs in coding regions that may damage proteins’ function or SVs in the 92 Solanaceae evolutionarily constrained regions are considered to be deleterious¹⁰. We thus identified putative dSVs as being derived-state SVs with low-frequency (<0.05) that overlap with coding regions or evolutionarily constrained regions; we implemented this with the BEDTools ‘overlap’ command to infer putative dSVs. Putative dSNPs were identified by showing GERP values >2.75 and overlapping with evolutionarily constrained regions¹⁰.

Estimation of the broken-window effect

The Pearson correlation coefficient was used to examine the correlation between the number of dSNPs and dSVs per chromosome. Focal dSVs located in haplotype 1 (H1) indicates H1 was considered to be the dSV coupling phase and corresponding syntenic regions of H2 were considered to be the dSV repulsion phase. We calculated the distance of the focal dSV to the nearest dSV in coupling phase and to the nearest dSV in repulsion phase (Extended Data Fig. 9). To exclude the potential concurrent influence of dSVs on both coupling and repulsion phases, we segmented chromosomes into fragments based on the midpoint between adjacent dSVs (irrespective whether in coupling or in repulsion phase). Regions that may be affected by the broken-window effect were estimated based on 10-kb non-overlapping sliding windows. For each sliding window, we calculated the number of dSNPs flanking each dSV in the coupling and repulsion phases. To assess dSNP enrichment, for each sliding window we performed a two-tailed Student’s t-test for the numbers of associated dSNPs in coupling and repulsion phase, respectively (regions significantly affected by the broken-window effect of dSVs). The depth of reads near the dSV is calculated using the SAMtools (v.1.17) ‘depth’ command. Finally, we employed SafFire (https://github.com/mrvollger/SafFire) to visualize schematic representations of dSVs and dSNPs within haplotypes.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Data supporting the findings of this work are available within the paper and its supplementary information files. Whole-genome sequencing data are accessible through NCBI under the BioProject accession number PRJNA1020967 and from the National Genomics Data Center (NGDC) Genome Sequence Archive (GSA) (https://ngdc.cncb.ac.cn/gsa/) with the BioProject accession number PRJCA020375. Genome sequences, gene annotations and variation maps were uploaded to the Potato Repositories Website (http://solomics.agis.org.cn/potato/) and Figshare (https://figshare.com/articles/dataset/IPHs/25846003 (ref. ¹⁰¹)). Publicly available sequencing data were downloaded from the NCBI with BioProject accession numbers PRJNA754534, PRJNA641265, PRJNA573826 and PRJNA766763.

Code availability

Scripts used in this Article are available at https://github.com/Chenglin20170390/Haplotype-diversity.

References

Stokstad, E. The new potato. Science 363, 574–577 (2019).
ADS CAS PubMed Google Scholar
Spooner, D. M., Ghislain, M., Simon, R., Jansky, S. H. & Gavrilenko, T. Systematics, diversity, genetics, and evolution of wild and cultivated potatoes. Bot. Rev. 80, 283–383 (2014).
Google Scholar
Zhang, C. et al. Genome design of hybrid potato. Cell 184, 3873–3883 (2021).
CAS PubMed Google Scholar
The Potato Genome Sequencing Consortium. Genome sequence and analysis of the tuber crop potato. Nature 475, 189–195 (2011).
Google Scholar
Zhou, Q. et al. Haplotype-resolved genome analyses of a heterozygous diploid potato. Nat. Genet. 52, 1018–1023 (2020).
CAS PubMed PubMed Central Google Scholar
Lian, Q. et al. Acquisition of deleterious mutations during potato polyploidization. J. Integr. Plant Biol. 61, 7–11 (2019).
CAS PubMed Google Scholar
Bao, Z. et al. Genome architecture and tetrasomic inheritance of autotetraploid potato. Mol. Plant 15, 1211–1226 (2022).
CAS PubMed Google Scholar
Jansky, S. H. et al. Reinventing potato as a diploid inbred line-based crop. Crop Sci. 56, 1412–1422 (2016).
CAS Google Scholar
Zhang, C. et al. The genetic basis of inbreeding depression in potato. Nat. Genet. 51, 374–378 (2019).
CAS PubMed Google Scholar
Wu, Y. et al. Phylogenomic discovery of deleterious mutations facilitates hybrid potato breeding. Cell 186, 2313–2328.e15 (2023).
CAS PubMed Google Scholar
van Lieshout, N. et al. Solyntus, the new highly contiguous reference genome for potato (Solanum tuberosum). G3 10, 3489–3495 (2020).
PubMed PubMed Central Google Scholar
Freire, R. et al. Chromosome-scale reference genome assembly of a diploid potato clone derived from an elite variety. G3 11, jkab330 (2021).
CAS PubMed PubMed Central Google Scholar
Tang, D. et al. Genome evolution and diversity of wild and cultivated potatoes. Nature 606, 535–541 (2022).
ADS CAS PubMed PubMed Central Google Scholar
Sun, H. et al. Chromosome-scale and haplotype-resolved genome assembly of a tetraploid potato cultivar. Nat. Genet. 54, 342–348 (2022).
CAS PubMed PubMed Central Google Scholar
Ye, M. et al. Generation of self-compatible diploid potato by knockout of S-RNase. Nat. Plants 4, 651–654 (2018).
CAS PubMed Google Scholar
Chun, S. & Fay, J. C. Identification of deleterious mutations within three human genomes. Genome Res. 19, 1553–1561 (2009).
CAS PubMed PubMed Central Google Scholar
Marsden, C. D. et al. Bottlenecks and selective sweeps during domestication have increased deleterious genetic variation in dogs. Proc. Natl Acad. Sci. USA 113, 152–157 (2016).
ADS CAS PubMed Google Scholar
Liu, Q., Zhou, Y., Morrell, P. L. & Gaut, B. S. Deleterious variants in Asian rice and the potential cost of domestication. Mol. Biol. Evol. 34, 908–924 (2017).
CAS PubMed Google Scholar
Zhang, X. et al. Haplotype-resolved genome assembly provides insights into evolutionary history of the tea plant Camellia sinensis. Nat. Genet. 53, 1250–1259 (2021).
CAS PubMed PubMed Central Google Scholar
Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).
ADS CAS PubMed PubMed Central Google Scholar
Schreiber, M., Jayakodi, M., Stein, N. & Mascher, M. Plant pangenomes for crop improvement, biodiversity and evolution. Nat. Rev. Genet. 25, 563–577 (2024).
CAS PubMed PubMed Central Google Scholar
Hardigan, M. A. et al. Genome diversity of tuber-bearing Solanum uncovers complex evolutionary history and targets of domestication in the cultivated potato. Proc. Natl Acad. Sci. USA 114, E9999–E10008 (2017).
CAS PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
PubMed Google Scholar
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
ADS CAS PubMed PubMed Central Google Scholar
Pham, G. M. et al. Construction of a chromosome-scale long-read reference genome assembly for potato. GigaScience 9, giaa100 (2020).
PubMed PubMed Central Google Scholar
Garrison, E. et al. Building pangenome graphs. Nat. Methods 21, 2008–2012 (2024).
CAS PubMed Google Scholar
Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 42, 663–673 (2024).
CAS PubMed Google Scholar
Garrison, E. & Guarracino, A. Unbiased pangenome graphs. Bioinformatics 39, btac743 (2023).
CAS PubMed Google Scholar
Gong, Z. et al. Repeatless and repeat-based centromeres in potato: implications for centromere evolution. Plant Cell 24, 3559–3574 (2012).
CAS PubMed PubMed Central Google Scholar
Bozan, I. et al. Pangenome analyses reveal impact of transposable elements and ploidy on the evolution of potato species. Proc. Natl Acad. Sci. USA 120, e2211117120 (2023).
ADS CAS PubMed PubMed Central Google Scholar
Domínguez, M. et al. The impact of transposable elements on tomato diversity. Nat. Commun. 11, 4058 (2020).
ADS PubMed PubMed Central Google Scholar
Wicker, T. et al. A detailed look at 7 million years of genome evolution in a 439 kb contiguous sequence at the barley Hv‐eIF4E locus: recombination, rearrangements and repeats. Plant J. 41, 184–194 (2005).
CAS PubMed Google Scholar
Balachandran, P. et al. Transposable element-mediated rearrangements are prevalent in human genomes. Nat. Commun. 13, 7115 (2022).
ADS CAS PubMed PubMed Central Google Scholar
Fang, Z. et al. Megabase-scale inversion polymorphism in the wild ancestor of maize. Genetics 191, 883–894 (2012).
PubMed PubMed Central Google Scholar
Berdan, E. L., Blanckaert, A., Butlin, R. K. & Bank, C. Deleterious mutation accumulation and the long-term fate of chromosomal inversions. PLoS Genet. 17, e1009411 (2021).
CAS PubMed PubMed Central Google Scholar
Zhou, Y. et al. The population genetics of structural variants in grapevine domestication. Nat. Plants 5, 965–979 (2019).
PubMed Google Scholar
Roumeliotis, E., Kloosterman, B., Oortwijn, M., Visser, R. G. & Bachem, C. W. The PIN family of proteins in potato and their putative role in tuberization. Front. Plant Sci. 4, 524 (2013).
PubMed PubMed Central Google Scholar
Cho, S. K. et al. Polypyrimidine tract-binding proteins of potato mediate tuberization through an interaction with StBEL5 RNA. J. Exp. Bot. 66, 6835–6847 (2015).
CAS PubMed PubMed Central Google Scholar
Wang, B. et al. De novo genome assembly and analyses of 12 founder inbred lines provide insights into maize heterosis. Nat. Genet. 55, 312–323 (2023).
CAS PubMed Google Scholar
Xiao, H. et al. Adaptive and maladaptive introgression in grapevine domestication. Proc. Natl Acad. Sci. USA 120, e2222041120 (2023).
CAS PubMed PubMed Central Google Scholar
Bethke, P. C. et al. Diploid potatoes as a catalyst for change in the potato industry. Am. J. Potato Res. 99, 337–357 (2022).
Google Scholar
Mezmouk, S. & Ross-Ibarra, J. The pattern and distribution of deleterious mutations in maize. G3 4, 163–171 (2014).
PubMed Google Scholar
Li, Y. et al. Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nat. Genet. 42, 969–972 (2010).
CAS PubMed Google Scholar
Wang, N. et al. Structural variation and parallel evolution of apomixis in citrus during domestication and diversification. Natl Sci. Rev. 9, nwac114 (2022).
CAS PubMed PubMed Central Google Scholar
Harcourt, B. E. Reflecting on the subject: a critique of the social influence conception of deterrence, the broken windows theory, and order-maintenance policing New York style. Mich. Law Rev. 97, 291 (1998).
Google Scholar
Morgan, A. P. et al. Structural variation shapes the landscape of recombination in mouse. Genetics 206, 603–619 (2017).
CAS PubMed PubMed Central Google Scholar
Porubsky, D. et al. Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders. Cell 185, 1986–2005 (2022).
CAS PubMed PubMed Central Google Scholar
Rowan, B. A. et al. An ultra high-density Arabidopsis thaliana crossover map that refines the influences of structural variation and epigenetic features. Genetics 213, 771–787 (2019).
CAS PubMed PubMed Central Google Scholar
Ramu, P. et al. Cassava haplotype map highlights fixation of deleterious mutations during clonal propagation. Nat. Genet. 49, 959–963 (2017).
CAS PubMed Google Scholar
Jiao, Y. et al. Regulation of OsSPL14 by OsmiR156 defines ideal plant architecture in rice. Nat. Genet. 42, 541–544 (2010).
CAS PubMed Google Scholar
Jiang, X. et al. Genomic features of meiotic crossovers in diploid potato. Hortic. Res. 10, uhad079 (2023).
CAS PubMed PubMed Central Google Scholar
Hayes, B. J. et al. Potential approaches to create ultimate genotypes in crops and livestock. Nat. Genet. 56, 2310–2317 (2024).
CAS PubMed Google Scholar
Filler-Hayut, S., Kniazev, K., Melamed-Bessudo, C. & Levy, A. A. Targeted inter-homologs recombination in Arabidopsis euchromatin and heterochromatin. Int. J. Mol. Sci. 22, 12096 (2021).
CAS PubMed PubMed Central Google Scholar
Schmidt, C., Schindele, P. & Puchta, H. From gene editing to genome engineering: restructuring plant chromosomes via CRISPR/Cas. Abiotech 1, 21–31 (2020).
PubMed Google Scholar
Eizenga, J. M. et al. Pangenome graphs. Annu. Rev. Genomics Hum. Genet. 21, 139–162 (2020).
CAS PubMed PubMed Central Google Scholar
Rakocevic, G. et al. Fast and accurate genomic analyses using genome graphs. Nat. Genet. 51, 354–362 (2019).
CAS PubMed Google Scholar
Alonge, M. et al. Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell 182, 145–161. e23 (2020).
CAS PubMed PubMed Central Google Scholar
Leffler, E. M. et al. Resistance to malaria through structural variation of red blood cell invasion receptors. Science 356, eaam6393 (2017).
PubMed PubMed Central Google Scholar
Yan, H. et al. Pangenomic analysis identifies structural variation associated with heat tolerance in pearl millet. Nat. Genet. 55, 507–518 (2023).
CAS PubMed PubMed Central Google Scholar
Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2010).
CAS PubMed Google Scholar
Xue, J. R. et al. The functional and evolutionary impacts of human-specific deletions in conserved elements. Science 380, eabn2253 (2023).
CAS PubMed PubMed Central Google Scholar
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
PubMed PubMed Central Google Scholar
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).
CAS PubMed PubMed Central Google Scholar
Kaplan, N. & Dekker, J. High-throughput genome scaffolding from in vivo DNA interaction frequency. Nat. Biotechnol. 31, 1143–1147 (2013).
CAS PubMed PubMed Central Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
PubMed PubMed Central Google Scholar
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020).
ADS CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
ADS CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).
CAS PubMed PubMed Central Google Scholar
Martin, M. et al. WhatsHap: fast and accurate read-based phasing. Preprint at BioRxiv https://doi.org/10.1101/085050 (2016).
Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J. & Clavijo, B. J. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33, 574–576 (2017).
CAS PubMed Google Scholar
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20, 275 (2019).
CAS PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl Acad. Sci. USA 117, 9451–9457 (2020).
ADS CAS PubMed PubMed Central Google Scholar
Delehelle, F., Cussat-Blanc, S., Alliot, J.-M., Luga, H. & Balaresque, P. ASGART: fast and parallel genome scale segmental duplications mapping. Bioinformatics 34, 2708–2714 (2018).
CAS PubMed Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
CAS PubMed PubMed Central Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
CAS PubMed PubMed Central Google Scholar
Brůna, T., Hoff, K. J., Lomsadze, A., Stanke, M. & Borodovsky, M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genomics Bioinformatics 3, lqaa108 (2021).
PubMed PubMed Central Google Scholar
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435–W439 (2006).
CAS PubMed PubMed Central Google Scholar
Lukashin, A. V. & Borodovsky, M. GeneMark. hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998).
CAS PubMed PubMed Central Google Scholar
The Tomato Genome Consortium. The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485, 635–641 (2012).
ADS Google Scholar
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
CAS PubMed Google Scholar
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011).
PubMed PubMed Central Google Scholar
Venturini, L., Caim, S., Kaithakottil, G. G., Mapleson, D. L. & Swarbreck, D. Leveraging multiple transcriptome assembly methods for improved gene structure annotation. GigaScience 7, giy093 (2018).
PubMed PubMed Central Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
CAS PubMed PubMed Central Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
CAS PubMed PubMed Central Google Scholar
Marco-Sola, S., Moure, J. C., Moreto, M. & Espinosa, A. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics 37, 456–463 (2021).
CAS PubMed Google Scholar
Guarracino, A., Heumos, S., Nahnsen, S., Prins, P. & Garrison, E. ODGI: understanding pangenome graphs. Bioinformatics 38, 3319–3326 (2022).
CAS PubMed PubMed Central Google Scholar
Parmigiani, L., Garrison, E., Stoye, J., Marschall, T. & Doerr, D. Panacus: fast and exact pangenome growth and core size estimation. Bioinformatics https://doi.org/10.1093/bioinformatics/btae720 (2024).
Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature 606, 527–534 (2022).
ADS CAS PubMed PubMed Central Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
CAS PubMed Google Scholar
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).
PubMed PubMed Central Google Scholar
Nguyen, L.-T., Schmidt, H. A., Von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
CAS PubMed Google Scholar
Yu, G., Smith, D. K., Zhu, H., Guan, Y. & Lam, T. T. Y. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol. Evol. 8, 28–36 (2017).
Google Scholar
Lovell, J. T. et al. GENESPACE tracks regions of interest and gene copy number variation across multiple genomes. eLife 11, e78526 (2022).
CAS PubMed PubMed Central Google Scholar
Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).
CAS PubMed PubMed Central Google Scholar
Goel, M., Sun, H., Jiao, W.-B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 20, 277 (2019).
PubMed PubMed Central Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
PubMed PubMed Central Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
CAS PubMed PubMed Central Google Scholar
Wickham, H. ggplot2. Wiley Interdiscip. Rev. Comput. Stat. 3, 180–185 (2011).
Google Scholar
Cheng, L. IPHs. Figshare https://doi.org/10.6084/m9.figshare.25846003.v8 (2024).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (32488302), Guangdong Major Project of Basic and Applied Basic Research (2021B03C1030004), the National Key Research and Development Program of China (2019FA0906200), Agricultural Science and Technology Innovation Program (CAAS-ZDRW202404), the National Natural Science Foundation of China (32422079) and Shenzhen Outstanding Talents Training Fund.

Author information

These authors contributed equally: Lin Cheng, Nan Wang, Zhigui Bao

Authors and Affiliations

National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
Lin Cheng, Nan Wang, Zhigui Bao, Yuting Yang, Pei Wang, Zhiyang Zhang, Dié Tang, Pingxian Zhang, Yaoyao Wu, Yao Zhou, Yi Zheng, Yong Hu, Qun Lian, Zhaoxu Ma, Chunzhi Zhang, Yongfeng Zhou & Sanwen Huang
Plant Genetics and Rhizosphere Processes Laboratory, TERRA Teaching and Research Center, Gembloux Agro-Bio Tech, University of Liège, Gembloux, Belgium
Lin Cheng & Ludivine Lassois
National Key Laboratory of Tropical Crop Breeding, Chinese Academy of Tropical Agricultural Sciences, Haikou, China
Nan Wang, Yongfeng Zhou & Sanwen Huang
Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
Zhigui Bao
School of Agriculture and Biotechnology, Sun Yat-Sen University, Shenzhen, China
Qian Zhou
Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
Andrea Guarracino & Erik Garrison
Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
Dié Tang
College of Horticulture, Nanjing Agricultural University, Nanjing, China
Yaoyao Wu
Key Laboratory of Plant Molecular Physiology, Institute of Botany, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China
Yao Zhou
Department of Plant Biology, College of Biological Sciences, University of California, Davis, Davis, CA, USA
William J. Lucas
Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
Nils Stein
Crop Plant Genetics, Institute of Agricultural and Nutritional Sciences, Martin-Luther-University of Halle-Wittenberg, Halle (Saale), Germany
Nils Stein
Institute of Integrative Biology and Zurich-Basel Plant Science Center, ETH Zurich, Zurich, Switzerland
Thomas Städler

Authors

Lin Cheng
View author publications
Search author on:PubMed Google Scholar
Nan Wang
View author publications
Search author on:PubMed Google Scholar
Zhigui Bao
View author publications
Search author on:PubMed Google Scholar
Qian Zhou
View author publications
Search author on:PubMed Google Scholar
Andrea Guarracino
View author publications
Search author on:PubMed Google Scholar
Yuting Yang
View author publications
Search author on:PubMed Google Scholar
Pei Wang
View author publications
Search author on:PubMed Google Scholar
Zhiyang Zhang
View author publications
Search author on:PubMed Google Scholar
Dié Tang
View author publications
Search author on:PubMed Google Scholar
Pingxian Zhang
View author publications
Search author on:PubMed Google Scholar
Yaoyao Wu
View author publications
Search author on:PubMed Google Scholar
Yao Zhou
View author publications
Search author on:PubMed Google Scholar
Yi Zheng
View author publications
Search author on:PubMed Google Scholar
Yong Hu
View author publications
Search author on:PubMed Google Scholar
Qun Lian
View author publications
Search author on:PubMed Google Scholar
Zhaoxu Ma
View author publications
Search author on:PubMed Google Scholar
Ludivine Lassois
View author publications
Search author on:PubMed Google Scholar
Chunzhi Zhang
View author publications
Search author on:PubMed Google Scholar
William J. Lucas
View author publications
Search author on:PubMed Google Scholar
Erik Garrison
View author publications
Search author on:PubMed Google Scholar
Nils Stein
View author publications
Search author on:PubMed Google Scholar
Thomas Städler
View author publications
Search author on:PubMed Google Scholar
Yongfeng Zhou
View author publications
Search author on:PubMed Google Scholar
Sanwen Huang
View author publications
Search author on:PubMed Google Scholar

Contributions

S.H. designed the project. S.H. and Yongfeng Zhou supervised the project. L.C. performed analyses of genome assembly, annotation, TE-mediated SV formation and GHSL of SVs. L.C. and N.W. analysed deleterious variants and the construction of IPHs. N.W. simulated breeding routines to achieve IPHs. Z.B., A.G. and E.G. performed graph construction and evaluation. P.W. contributed to the greenhouse work. Y.Y., Z.Z., D.T., Y.H., Z.M. and Q.L. assisted in bioinformatics analyses. P.Z., Y.W., Yi Zheng, Yao Zhou and L.L. provided essential technology support. L.C., N.W. and Z.B. wrote the draft. Q.Z., A.G., C.Z., W.J.L., E.G., N.S., T.S., Yongfeng Zhou and S.H. revised the manuscript. All authors contributed to manuscript preparation and read, commented on and approved the manuscript.

Corresponding author

Correspondence to Sanwen Huang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks Todd Michael, Peter Morrell and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Assembly strategy and validation of the potato pangenome.

a, Schematic diagram of the assembly process of the haplotype-resolved potato genome. b, Assessment of genome completeness by the read-mapping rate. c, Pangenome growth curves for 61 haplotypes (plus DMv6.1 reference) based on the Minigraph-Cactus (MC) graph, including conserved (present in at least 90% haplotypes) and variable sequence (<90% of haplotypes). d, Percentage of multiallelic SVs and indels for the PGGB and MC potato pangenomes.

Extended Data Fig. 2 Visualization of the potato chromosome 10 pangenome graph.

The visualization utilizes a 1D representation of the pangenome graph, with the DMv6.1 reference genome serving as the baseline for orientation. Graph nodes are arranged from left to right corresponding to their positions within the reference genome. The black lines under the paths are the links, which represent the graph topology. Black and red bars indicate regions oriented in the forward and reverse direction relative to the pangenome sequence (the complete sequence formed by concatenating all graph nodes). Access to visualizations of the 1D PGGB pangenome graphs for all chromosomes can be found in Figshare (https://figshare.com/articles/dataset/IPHs/25846003).

Extended Data Fig. 3 The statistics of TEMR mediated by LTR/Gypsy and LTR/Copia.

a, The density of estimated insertion time of TEMRs for LTR/Copia and LTR/Gypsy. b, The intact length of long terminal repeats in TEMRs for LTR/Copia and LTR/Gypsy. The boxes represent 75% and 25% quartiles, the central line denotes the median and the whiskers extend to 1.5 times the interquartile range. P < 2.2 × 10⁻¹⁶, two-tailed Student’s t-test, n = 70,000 for both LTR/Copia and LTR/Gypsy. Recently active TE events exhibit a higher degree of intact sequences and are associated with more recent TEMR events. c, The number and length of flanking homologous sequence of TEMRs for LTR/Copia and LTR/Gypsy. d, An example of a LTR/Gypsy-mediated deletion on chromosome 2 of the potato accession PG5003 genome.

Extended Data Fig. 4 The gene synteny ideogram constructed by protein sequences of all haplotypes.

Different chromosomes are distinguished by alternating light grey and dark grey. Two heterozygous inversions prevalent in cultivated potatoes are marked with dashed-line red rectangles.

Extended Data Fig. 5 An example of a heterozygous inversion in the PG6244 genome assembly.

a, Synteny plot for two haplotypes aligned to the DM genome. The haplotype H1 of PG6244 exhibits a 6.4-Mb inversion (INV894, chr10: 52.7–59.1 Mb) on chromosome 10 but PG6244 H2 shows synteny with the DMv6.1 genome. b, The Hi-C map for the assembly of the two PG6244 haplotypes for chromosome 10. An inversion signal between the two haplotypes is marked by the black rectangles.

Extended Data Fig. 6 Pairwise Jaccard similarity analysis of 60 potato haplotypes.

The heatmap indicates the correlation among 60 potato haplotypes. Paired haplotypes of each accession are indicated at the top and left with dashed lines. The rectangle colours represent different potato groups (see legend in upper-right corner).

Extended Data Fig. 7 The characteristics of SVs, dSNPs and dSVs in the potato genome.

a, The number of homozygous SVs, heterozygous SVs and the sum of all SVs (twice the number of homozygous SVs plus the number of heterozygous SVs) across our panel of potato genomes. b, The number of homozygous dSNPs, heterozygous dSNPs and the sum of all dSNPs across our panel of potato genomes. For a and b, the boxes represent 75% and 25% quartiles, the central line indicates the median and the whiskers extend to 1.5 times the interquartile range. P values are provided in each panel representing comparisons between wild (n = 10) and cultivated potato samples (n = 19), using two-tailed Student’s t-tests. c, Genome architecture of dSVs across the potato genome. d, The distribution of dSVs and dSNPs across the potato genome.

Extended Data Fig. 8 The recombination map of two inbred lines.

a, Recombination map of the parent E86-69 and its inbred offspring E4-63. The x-axis represents the chromosomes. The y-axis represents the number of heterozygous markers (SNPs and indels). Top: heterozygosity was identified by comparing E86-69 H1 and E86-69 H2; middle: heterozygosity was identified by comparing E86-69 H2 and E4-63; bottom: heterozygosity was identified by comparing E86-69 H1 and E4-63. The recombination map and events (solid triangles) were inferred based on the inheritance pattern of heterozygous variants between the parental (E86-69 H1 and H2) and their inbred offspring haplotypes (E4-63). b, Analogous recombination map of the parental line PG6359 and its inbred offspring A6-26.

Extended Data Fig. 9 Statistical significance of dSNP enrichment in the dSV coupling phase.

a, Diagram depicting our dSNP accumulation assessment. For calculation purposes, chromosomes were segmented into fragments based on the midpoint between adjacent dSVs (irrespective whether in coupling or repulsion phase). Statistical tests were performed on each 10-kb non-overlapping window within each fragment. b, Statistical significance for the number of dSNPs flanking dSVs using the 10-kb non-overlapping windows between dSV coupling and repulsion phases. The significance threshold of P = 0.05 (two-sided Student’s t-test) is indicated. c, Diagram illustrating the calculation of the distance of a focal dSV to the nearest dSV in coupling phase (Dis-coupling) and to the nearest dSV in repulsion phase (Dis-repulsion). The focal dSV is highlighted in red. d, The frequency distribution of Dis-coupling and Dis-repulsion in potato genomes.

Extended Data Fig. 10 The schematic diagram of IPHs-v1.0 for inbred lines.

a, The number of dSVs in the GON/STN group for inferring the ideal A inbred line. b, Similarly, the number of dSVs in the PHU group for designing the ideal E inbred line. c, Validating the power of recombination (simulated single recombination event) and haplotype selection in eliminating dSVs. We simulated selfing experiments and the distribution of dSVs in offspring. Two accessions (the optimal and worst-case scenarios) from heterotic group A, PG6359 and PG6029, are displayed. d, Similarly, two accessions (the optimal and worst-case scenarios) from heterotic group E, PG6245 and RH89-039-16. The simulated progenies show that recombination could dramatically reduce the number of dSVs at the haplotype level. e, Schematic diagram indicating the generation of ideal potato haplotypes (IPHs-v1.0) based on two heterotic groups including inbred lines E4-63 and A6-26. This approach prioritizes the minimization of the dSV count as its central objective. Recombination and haplotype selection could reduce the number of dSVs during potato recurrent selection.

Supplementary information

Supplementary Information

Reporting Summary

Supplementary Tables

Supplementary Tables 1–22.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Cheng, L., Wang, N., Bao, Z. et al. Leveraging a phased pangenome for haplotype design of hybrid potato. Nature 640, 408–417 (2025). https://doi.org/10.1038/s41586-024-08476-9

Download citation

Received: 04 January 2024
Accepted: 02 December 2024
Published: 22 January 2025
Issue Date: 10 April 2025
DOI: https://doi.org/10.1038/s41586-024-08476-9

This article is cited by

Mumemto: efficient maximal matching across pangenomes
- Vikram S. Shivakumar
- Ben Langmead
Genome Biology (2025)
Genus-wide plant pangenome could inform next-generation crop design
- Nadia Kamal
- Manuel Spannagl
Nature (2025)
Impacts of reproductive systems on grapevine genome and breeding
- Hua Xiao
- Yue Wang
- Yongfeng Zhou
Nature Communications (2025)