Main

Potato (S. tuberosum L.) is the most important tuber crop, feeding around 1.3 billion people annually in more than 120 countries1. Recent studies have decoded the genomes of monoploid, diploid and tetraploid potatoes, resolved self-incompatibility and dissected the genetic basis of inbreeding depression4,5,6,7,8,9,10,11,12,13,14,15. These efforts have led to the development of a first generation of highly homozygous inbred lines and subsequently, the first uniform hybrid as a proof-of-concept3, demonstrating the potential of genome design for revolutionizing potato breeding towards a fast, iterative mode.

A central focus of hybrid potato breeding programmes is development of a better understanding of deleterious variants4,5,6,7,8,9,10, which compromise growth and overall fitness9. dSNPs have been widely studied in many species such as humans, dogs and rice16,17,18. Structural variants (SVs), however, are expected to affect more genomic regions and are more likely to have stronger effects on fitness than SNPs; nonetheless, dSVs have rarely been studied. Phenotype-based selection is neither efficient nor accurate in purging deleterious variants closely linked in repulsion phase in diploid potatoes5. Haplotype-based selection based on in-depth understanding of deleterious variants and knowledge-guided recombination strategies are essential for integrating superior genomic fragments from different donors, paving the way for the eventual development of ideal potato haplotypes (IPHs).

Thus far, our knowledge of haplotype diversity in potatoes is limited. Only a few genomes have been published and most diploid potatoes are assembled into a single chimeric haploid genome, with only a few examples of haplotype-resolved assemblies11,12,13. Collapsed assemblies of two haplotypes from diploid genomes lose valuable information on haplotype diversity and the phase state, introducing multiple types of errors, including switch errors, false duplications and nucleotide consensus errors19,20. Haplotype-resolved assemblies provide a more accurate representation of genetic diversity and promise to become essential for future potato breeding5,7,14. Concurrently, there is a need to shift from single reference genomes to phased pangenome references that better capture diversity across multiple populations21.

In this study, we carried out de novo assembly for 60 haplotypes from 31 diploid potatoes, including 20 haplotypes from 10 wild accessions, 38 haplotypes from 19 domesticated diploid accessions and two haplotypes from two inbred lines. From these 60 potato haplotypes, we constructed a potato pangenome graph, which enabled us to uncover extensive non-reference sequences and document multiallelic SVs. We further reveal the origin and fate of SVs in evolution and domestication and design IPHs for future breeding. Our study provides in-depth understanding of the haplotype diversity of a clonal crop, establishing the theoretical foundation for the reinvention of potato.

The potato pangenome graph

To capture and characterize haplotype diversity in potato populations, we selected 31 accessions based on previous studies of genetic diversity and principal component analysis (PCA) of 193 accessions3,5,9,13 (Fig. 1a and Supplementary Figs. 1 and 2). The selected accessions comprise 19 landraces covering three indigenous diploid cultivated groups (S. tuberosum Group Phureja (PHU), S. tuberosum Group Stenotomum (STN) and S. tuberosum Group Goniocalyx (GON)), two inbred lines (A6-26 and E4-63, the 5th-generation selfed offspring of diploid PG6359 and E86-69, respectively)3 and 10 wild accessions from Solanum candolleanum (CND), which is believed to be the progenitor of domesticated potatoes22 (Supplementary Table 1). We generated an average of 28.15 Gb (approximately 38-fold) PacBio high-fidelity (HiFi) reads and 110.43 Gb (approximately 149-fold) high‐throughput chromosome conformation capture technique (Hi-C) reads (Supplementary Tables 2 and 3). Genome surveys using the HiFi reads revealed an average of around 1.55% genomic heterozygosity across cultivated diploid potato genomes (excluding inbred lines; Supplementary Table 4). Our initial de novo assembled haplotypes have an average genome size of 811 Mb, with contig N50 of 12.25 Mb (Fig. 1b and Supplementary Table 4). Paired haplotypes of each heterozygous diploid were further scaffolded on the basis of Hi-C contact maps (Extended Data Fig. 1a and Supplementary Fig. 3), resulting in pseudo-chromosome assemblies with an average anchor rate of 95.17% (Supplementary Table 5). Assessment using BUSCO23 demonstrated a high level of completeness with an average of 99.2% (Supplementary Table 5).

Fig. 1: Phased potato pangenome reference (PPG-v.1.0).
figure 1

a, PCA from 193 potato accessions; the 31 selected accessions are highlighted with solid dots. b, Assessment of genome contiguity by contig N50 for haplotype 1 (solid lines) and haplotype 2 (dashed lines). The Nx value represents x% of the total contigs length that is covered by the shortest contig length. c, Assessment of correctness by switch errors and Hamming errors for each diploid assembly. d, Reliability of the 29 diploid assemblies (excluding 2 inbred lines). For each haplotype-resolved assembly, the left bar represents haplotype H1 and the right bar represents H2. Regions identified as haploid are considered trustworthy (green). The y axis is broken to illustrate both the prominence of the reliable haploid component and the stratification of the unreliable segments. e, Pangenome growth curves for wild, cultivated, and wild plus cultivated groups. Pie charts represent conserved sequence (present in at least 90% of the haplotypes) and variable sequence in the PGGB pangenome graph. Numbers in the pie charts represent sequence length (Mb). f, Non-reference nodes (>50 bp) in the potato PGGB pangenome were annotated as four primary types, including TEs, satellite repeats, low-complexity (simple repeats) and non-repetitive sequences. Six different TE families were identified.

Three additional independent analyses confirm the quality of our assemblies: (1) the low switch error rates (0.21% on average) and Hamming errors (1.05% on average) indicate the high correctness of our assemblies (Fig. 1c); (2) the high read mapping rate (mean 99.99%) and consistent k-mer spectrum (Extended Data Fig. 1b and Supplementary Fig. 4) confirm the completeness of the assemblies; and (3) to further estimate the reliability of our assemblies, we also used Flagger24 to estimate the potentially erroneous regions. This confirmed that 98.9% of our assemblies are not collapsed or duplicated (Fig. 1d and Supplementary Table 6). We annotated an average of 449.0 Mb (59.69%) as being composed of transposable elements (TEs) (Supplementary Table 7) and predict 36,421–40,781 gene structure models for each haplotype (Supplementary Table 8).

To better represent non-reference sequences in potato genomes, we built two pangenome graphs for 60 haplotype assemblies plus the reference DMv6.125 using the PanGenome Graph Builder (PGGB)26 and Minigraph-Cactus27, respectively. We found that the Minigraph-Cactus pipeline removed highly diverged and large inverted regions, resulting in discarding of an average of 14.5% of sequence per chromosome (Supplementary Fig. 5). Therefore, we utilized PGGB to build a potato pangenome graph, which reduces reference bias28 and is more sensitive for variation calling than methods based on mapping to a linear reference genome (Supplementary Fig. 6 and Supplementary Tables 9 and 10). This unbiased potato pangenome graph (PPG-v.1.0) comprises 248.64 million nodes and 345.61 million edges with total sequence length of 3,076 Mb. Wild potato accessions contribute a larger portion to the pangenome graph (2,286 Mb) than cultivated potato assemblies (1,807 Mb), indicating a higher pangenome graph growth rate in wild potatoes (Supplementary Table 11). We determined that PPG-v.1.0 consists of 365 Mb conserved sequences (present in at least 90% of the haplotypes) and 2,711 Mb variable sequences (Fig. 1e). Using the DMv6.1 reference as a baseline, PPG-v.1.0 also reveals prevalent rearrangements (Extended Data Fig. 2). Furthermore, we annotated non-reference nodes in each haplotype path: 52.6% of them overlap with TEs, 22.7% overlap with satellite repeats, and 23.8% are classified as non-repetitive sequences (Fig. 1f). Notably, 13.3% of repetitive sequences exhibit characteristics of both transposons and satellites (Supplementary Table 12) potentially originating from atypical centromere or subtelomere repeats29 in the potato genome.

Finally, we deconstructed genome variation from the graph structure for small variants (SNPs and insertion–deletion mutations (indels) smaller than 50 bp in size) and structural variants (SVs ≥ 50 bp) based on DMv6.1 coordinates. In total, the constructed potato haplotype variation map includes 46,012,502 SNPs, 3,587,850 indels and 133,264 SVs. Notably, 16.9% of indels and 87.0% of SVs are multiallelic, with three or more distinct alleles. The graph generated by the Minigraph-Cactus pipeline shows a similar trend (Extended Data Fig. 1c,d and Supplementary Table 13). Collectively, the PPG-v.1.0 generated from phased diploid assemblies provides a fundamental resource for in-depth exploration of haplotype diversity and its consequences for potato breeding.

TEs drive SV formation

To reveal the mechanistic origins and dynamics of SVs in potato genomes, we analysed the pangenome SV sequences and their residual flanking sequences (±100 bp), which we annotated as one of three types: TEs, tandem repeats (TRs) and segmental duplications. We found that TEs account for 90.6% of repetitive elements associated with SVs (Fig. 2a and Supplementary Table 14). Additionally, 28.2% of SVs are covered by a single TE sequence (Supplementary Fig. 7), suggesting that these features are likely to reflect TE insertion events30,31. Recent reports suggest that TEs could serve as substrates for ectopic DNA repair leading to SV formation32,33. Therefore, we examined SV breakpoints and found that 33.8% could be classified as being compatible with ectopic recombination events from TE-mediated rearrangement (TEMR), characterized by both breakpoints (100 bp) having the same TE class (Supplementary Table 15, Supplementary Fig. 8 and Methods). This observed TEMR rate is significantly higher than the expected rate of 7.7% (P < 2.2 × 10–16, two-tailed Student’s t-test), supporting a significant contribution of homologous TEs in mediating SV formation. TEMR events were further classified on the basis of TE families, revealing that Gypsy long terminal repeat retrotransposon (LTR/Gypsy)-mediated TEMR (42.4%) is seven times more prevalent than that involving Copia long terminal repeat retrotransposons (LTR/Copia (7.0%)) (Fig. 2b). Moreover, SVs mediated by LTR/Gypsy TEMR are significantly longer than those mediated by LTR/Copia TEMR (average of 7,005 bp versus 1,899 bp; P < 2.2 × 10–16, two-sided Student’s t-test), and exhibit more recent and frequent activity (Fig. 2c and Extended Data Fig. 3).

Fig. 2: TE activity drives SVs in potato haplotypes.
figure 2

a, Proportion diagram of structural variation formation driven by different repetitive sequences, including TEs, TRs and segmental duplications (SDs). b, Left, percentage of TEMR through ectopic recombination by different families. Right, percentage of different TE families in TEMR among the three SV types. Del, deletion; Inv, inversion; Dup, duplication. Inserted sequences can be formed by TE movement and do not fall under our definition of TEMR. c, Comparison of SV length of TEMR for different TE families. A two-sided Student’s t-test was used to calculate the P values. NS, not significant; ***P < 2.2 × 10−16, n = 2,768 (LTR/Gypsy), 361 (LTR/Copia), 596 (DNA/Helitron) and 1,044 (DNA/TIR). d, Validation for the 4.0-Mb de novo inversion using Hi-C data from potato E86-69; the haplotype of inbred line E4-63 was used as the reference genome. e, Sequence analysis in the de novo inversion. Top, synteny plot of the de novo inversion. Middle, distribution of TEs in the de novo inversion. Bottom, the 466-bp homologous sequence stretch of the LTR mediating the de novo inversion.

To investigate the potential recent TE activity in potato genomes, we also compared two homozygous inbred lines with their heterozygous founders. Notably, we identified an approximately 4.0 Mb (chromosome (chr.) 6: 41.7–45.7 Mb) long terminal repeat retrotransposon (LTR)-mediated de novo paracentric inversion (Fig. 2d,e and Supplementary Fig. 9) in the fifth-generation inbred line E4-63, but not in the two founder haplotypes of the parental E86-69. Therefore, it is advisable to approach this de novo inversion with caution when implementing recurrent selection in hybrid potato breeding34,35.

Domestication boosts heterozygosity

Previous studies suggested that there was increased heterozygosity in cultivated potatoes compared to wild ones22, consistent with heterozygosity evaluated in our study on the basis of k-mer statistics using HiFi reads. To better describe heterozygosity and the loss of collinearity between the two haplotypes within the same diploid individual, we introduce the concept of genome heterozygosity by sequence length (GHSL). This approach complements existing k-mer-based methods by quantifying the total sequence length of non-redundant variants (SNPs, indels and SVs) between two haplotypes in phased diploid assemblies (Supplementary Fig. 10). We found the GHSL to be approximately 93.8 Mb in our phased assemblies, constituting 12.5% of the average haplotype length (Fig. 3a and Supplementary Table 16). This estimate is similar to estimates by read mapping in clonally propagated grapevine36. Moreover, our analysis shows that inversions account for a large proportion of haplotype-specific SVs (59.4% on average) (Supplementary Table 16).

Fig. 3: Potato domestication enhanced haplotype divergence.
figure 3

a, Genomic length affected by haplotype-specific (heterozygous) variants by comparing haplotype H1 to H2 across 29 diploid potato accessions. b, GHSL in wild and cultivated potatoes. The boxes represent 75% and 25% quartiles, the central line indicates the median and the whiskers extend to 1.5 times the interquartile range; P = 2.6 × 10−4, two-tailed Student’s t-test, n = 19 (cultivated) and 10 (wild). c, PCA of chromosome 1 haplotypes, with dotted lines connecting paired haplotypes in heterozygous diploid potatoes. d, The phylogeny of 63 haplotypes for chromosome 1 was constructed using the maximum-likelihood method based on 1,074 single-copy genes. Dashed lines (right) connect the paired haplotypes in each diploid genome. RH, RH89-039-16. S. lycopersicum, Solanum lycopersicum; S. etuberosum, Solanum etuberosum; S. wrightii, Solanum wrightii. e, Unfolded site frequency spectrum of four types of SVs and non-synonymous single nucleotide polymorphisms (nSNPs) compared to putatively neutral synonymous SNPs (sSNPs) across cultivated potatoes.

We found that haplotype divergence in cultivated potatoes, as quantified by GHSL, is 1.46 times greater than that observed across wild potatoes (approximately 105.2 Mb versus 72.1 Mb; Fig. 3b, P = 2.6 × 10−4, two-sided Student’s t-test). Notably, two haplotype-specific inversions are quite prevalent in cultivated potatoes (Extended Data Fig. 4). One inversion has previously been reported and is associated with a locus that regulates yellow tuber flesh13 (DMv6.1, chr. 3: 42.9–48.7 Mb). The other inversion (DMv6.1, chr. 10: 52.7–59.1 Mb) contains 661 genes, including StPIN7 (Soltu.DM.10G026500.1) and StPTB6 (Soltu.DM.10G026670.1), which may be associated with auxin transport37 and tuberization38, respectively (Extended Data Fig. 5).

To explore the extent of genome heterozygosity and haplotype divergence during potato domestication, we conducted PCA separately for each chromosome and found that the two haplotypes of each wild potato form ‘nearest neighbours’, whereas the two haplotypes of each cultivated potato exhibit a more dispersed pattern (Fig. 3c and Supplementary Fig. 11). We constructed haplotype trees for each chromosome pair based on single-copy genes, using three additional Solanum species as outgroups (Fig. 3d and Supplementary Fig. 12). Analogous to the PCA-based patterns, the two haplotypes of each cultivated potato typically reveal more pronounced genetic divergence than those of each wild potato. We conducted additional validation through pairwise Jaccard similarity analysis and alignment for split windows, both of which indicate that multiple hybridization events have affected the haplotype landscape in cultivated potatoes (Extended Data Fig. 6 and Supplementary Fig. 13). Molecular evidence in maize hybrids suggests that the length of nonsyntenic regions between parental lines is strongly correlated with levels of heterosis39. Our findings of increased genome heterozygosity and haplotype divergence in cultivated potatoes are similar to those in grapevine, another clonal crop40. The enhanced genome heterozygosity in cultivated potatoes compared with wild potatoes results in reduced homozygous deleterious burden (Extended Data Fig. 7a,b), indicating that potato domestication included an exploration of heterosis.

Fate of dSVs under breeding

To characterize the fitness effects of SVs in potato cultivars, we computed the unfolded site frequency spectrum on the basis of SVs and SNPs from domesticated potatoes (Fig. 3e). We found that SVs are over-represented for singletons and other minor-allele frequency classes, compared to both synonymous SNPs and even non-synonymous SNPs at minor frequencies, consistent with equivalent observations in grapevine genomes36. Cultivated potatoes have around 2.7 times more heterozygous SVs (average of 20,613 per individual) than homozygous SVs (average of 7,561) (Extended Data Fig. 7a). Additionally, 39.4% of the heterozygous SVs are located either in the gene bodies or in putative promoter regions (Supplementary Table 17). These findings suggest that SVs are more likely to be strongly deleterious than SNPs and are typically present in heterozygous state with relatively low frequencies across diploid potato cultivars.

The presence of dSVs in heterozygous state could lead to strong inbreeding depression upon forced inbreeding, thus hindering the development of inbred lines, with important implications for hybrid potato breeding3,9,41. Inferred purifying selection against SVs is stronger than against SNPs, implying that SV frequencies can be useful for an initial assessment of putative selection pressure42,43. Recently, deep phylogenomic analyses of 92 species in the family Solanaceae assessed genome-wide evolutionary constraints (with highly constrained regions amounting to 31.7 Mb under the threshold genomic evolutionary rate profiling (GERP) score ≥ 2.0, including non-synonymous, synonymous and non-coding sites), forming the basis for identifying and quantifying dSNPs at genome-wide scales in potato10. Analogous to dSNP identification, we developed an unweighted approach to infer putative dSVs in the potato genome on the basis of the joint criteria of evolutionarily derived state, low allele frequency and being within gene-coding or evolutionarily constrained regions (Fig. 4a and Methods). In total, we characterized 19,625 dSVs, 50.4% of which are located within non-coding regions (Extended Data Fig. 7c,d and Supplementary Table 18). For example, we identified a 51.1-kb deleterious deletion occurring in only one haplotype of cultivar PG6090. This deletion leads to truncated proteins or gene loss and is associated with three genes (Supplementary Fig. 14).

Fig. 4: Fate of dSVs in domestication and inbred line development.
figure 4

a, The pipeline for identifying dSVs includes the joint criteria of derived state of low-frequency SVs (≤0.05) in gene-coding and/or evolutionarily constrained regions. b, The percentage of homozygous and heterozygous dSVs in wild and cultivated potato populations. c, The number of dSVs per chromosome in the founder PG6359 (average of the two haplotypes) and its derivative inbred line A6-26. d, Schematic diagram of recombination events in chromosome 2 (left) and haplotype selection in chromosome 4 (right). The potential recombination events are marked with dashed lines. Deleterious variants from the H1 and H2 haplotypes are shown in different colours. e, Pearson correlation coefficient (r = 0.78) and P value (2.6 × 10−16) between numbers of dSNPs and dSVs across all chromosomes (n = 696). The shaded area represents the bootstrapped 95% confidence interval. f, Distribution of the number of dSNPs flanking focal dSVs for 10-kb non-overlapping windows in the dSV coupling phase (orange) and the dSV repulsion phase (grey). The points represent means of 58 haplotypes and error bars represent 95% confidence intervals. g, Average distance of a focal dSV to the nearest dSV in coupling phase (same haplotype; dis-coupling) and to the nearest dSV in syntenic regions of the other haplotype (dis-repulsion). Only distances <1 Mb are included. The boxes represent 75% and 25% quartiles, the central line indicates the median and the whiskers extend to 1.5 times the interquartile range; P < 2.2 × 10−16, two-tailed Student’s t-test, n = 13,717 (dis-coupling) and 11,679 (dis-repulsion). h, A zoomed-in view of biased dSNP accumulation (blue bars) around a heterozygous dSV (red triangle) within a 20-kb segment of chromosome 4 in E86-69 H2.

Excluding the inbred lines, we found that each diploid potato individual has an average of 843 dSVs, comprising 23.1 Mb of genomic sequence. Our analyses revealed that 73% of dSVs are in heterozygous state across wild potatoes, and this proportion increases to 97% in cultivated lines (Fig. 4b and Supplementary Fig. 15). These patterns indicate that dSVs tend to be sheltered in heterozygous state, potentially largely avoiding negative selection on them36,44. Thus these heterozygous dSVs require attention from breeders when choosing suitable materials for breeding.

Purging dSVs is crucial for developing elite inbred lines in hybrid potato breeding. We established that the available inbred lines, A6-26 (88 dSVs) and E4-63 (303 dSVs), carry fewer dSVs compared to the average haplotypes from the two parents, PG6359 (135 dSVs) and E86-69 (351 dSVs), respectively (Fig. 4c and Supplementary Fig. 16). To investigate the process of purging dSVs at the haplotype level, we constructed a genome-wide recombination map based on inbred lines and their respective parental haplotypes (Extended Data Fig. 8). Our results track at least 16 recombination events in one of the inbred lines across 5 generations of self-fertilization. For example, the 2 haplotypes from accession PG6359 carry 17 and 11 dSVs on chromosome 2; recombination events decreased the number of dSVs to seven in the derivative inbred line A6-26. For chromosome 4 of PG6359 H2, which carries fewer dSVs than PG6359 H1 (14 versus 30), 85.2% of dSVs are retained in the inbred line A6-26 (chr. 4: 7.5–59.5 Mb; 14 dSVs) (Fig. 4d and Supplementary Table 19).

The ‘broken-window’ effect of dSVs

On the basis of the genome-wide map of deleterious variants, we found a significant positive correlation between the numbers of dSVs and dSNPs per haplotype (r = 0.78, P < 2.2 × 10–16, Pearson correlation coefficient; Fig. 4e and Supplementary Tables 19 and 20). In diploid genomes, we define the coupling phase as the haplotype that contains a focal dSV, whereas the repulsion phase refers to the other haplotype. To assess the potential effect of dSVs on dSNPs, we compared the number of dSNPs surrounding dSVs in the coupling phase to the number of dSNPs in the syntenic region of the repulsion phase. We partitioned the genome into intervals on the basis of dSV distribution across both haplotypes and calculated statistical significance to identify affected genomic regions (Extended Data Fig. 9a and Methods). Our results suggest that dSNPs are significantly more frequently present in dSV coupling phase than in the dSV repulsion phase (Fig. 4f, Extended Data Fig. 9b and Supplementary Table 21). This accumulation signal of dSNPs tends to be more pronounced in close proximity to dSVs.

Furthermore, we investigated whether dSVs form clusters by calculating two types of distances: the distance between the focal dSV to the nearest dSV in coupling phase (same haplotype; dis-coupling), and its distance to the nearest dSV in repulsion phase, (dis-repulsion) (Extended Data Fig. 9c). Dis-repulsion is significantly larger than dis-coupling (Fig. 4g, 264.1 kb versus 202.1 kb, P < 2.2 × 10–16, two-sided Student’s t-test), suggesting that dSVs tend to form clusters within the same haplotype (Extended Data Fig. 9d). We provide an illustrative example of this phenomenon using the diploid potato line E86-69 (Supplementary Fig. 17).

The presence of dSVs, enhancing the occurrence of dSVs and dSNPs in the same phase, is analogous to a sociological hypothesis called the broken-window effect45. Therefore, we refer to this pattern as the ‘broken-window’ effect of dSVs. We validated this pattern by examining local assemblies, which cover the 10-kb regions flanking dSVs with an average coverage of 19 reads, as well as 6 long reads spanning the flanking regions. Theoretical considerations as well as empirical evidence suggest that SVs affect recombination rates of closely linked genomic regions46,47,48. Those associations may result from insufficient purging around the dSV owing to reduced recombination, whereas the dSV repulsion phase tends to remain functional, thus preventing strong selection on deleterious recessive variants9,49, leading to gradual differences in deleterious burden between the two haplotypes (Fig. 4h). The broken-window effect reveals that dSVs and dSNPs are not randomly distributed across potato genomes, as they tend to form clusters in coupling phase. Such deleterious clusters are important features of potato genomes, and they are easy to spot in our phased pangenome. We found that many deleterious variants were not efficiently purged during our previous development of two inbred lines3 (A6-26 and E4-63). Line A6-26 carries 88 dSVs and 61,017 dSNPs, whereas E4-63 has 303 dSVs and 71,200 dSNPs (Supplementary Tables 19 and 20). These deleterious variants in inbred lines substantially reduce their fitness3 and thus need to be purged as a foundation for future potato breeding.

Genome design of ideal haplotypes

Plant breeders adopted the concept of ideal plant architecture (IPA) to guide breeding of superior varieties by combining multiple desirable traits50. Similar to the IPA, we propose the IPHs strategy to guide breeders to develop a suite of inbred lines that are as close as possible to the ideal genotype (see Supplementary Methods).

Based on PPG-v.1.0, the starting point of computationally designed IPHs-v.1.0 are two heterotic groups; nine cultivars from the STN and GON groups with the inbred line A6-26 (heterotic group A), and eight cultivars from the PHU group with the inbred line E4-63 (heterotic group E) (Extended Data Fig. 10a,b). In principle, recombination events can reduce genetic burden in simulated haplotype combinations (Extended Data Fig. 10c,d). On the basis of the recombination map of two inbred lines (Extended Data Fig. 8), we assumed no more than four recombination events per chromosome in our IPHs. The consideration of recombination breakpoints takes into account the recombination coldspots in certain regions, such as heterochromatic centromeric regions and inversions. We estimated the distribution of dSVs and dSNPs under various haplotype donors and recombination events to identify the optimal combinations of IPHs, IPHs-A and IPHs-E (Supplementary Fig. 18).

We found that ideally, all dSVs could be removed from IPHs-A and IPHs-E and the number of dSNPs could be reduced to 41,262 and 35,360, implying a reduction of 32.4% (A6-26) and 50.3% (E4-63) compared with the previous inbred lines (Fig. 5a,b and Supplementary Table 22). Haplotype distances predict that hypothetical F1 hybrid genomes between IPHs-A and -E would carry only 10,830 homozygous dSNPs (Fig. 5c), implying a reduction of 54.5% compared with the previous F1 hybrid potato3. By integrating the genetic linkage map51, we found that achieving a IPHs-A chromosome 1 would require at least five generations of recurrent selection from hybrid progenies, and that one of the designed breakpoints is located in a low-recombination region (near chr. 1: 44.3 Mb), requiring a population size of at least 1,321 individuals to obtain at least 1 copy of IPHs-A chromosome 1 (Supplementary Fig. 19). It is worth noting that relying solely on natural recombination is unlikely to be sufficient (Supplementary Fig. 20). New technologies such as targeted recombination, gene editing and synthetic biology to precisely eliminate deleterious variants may provide future pathways for approaching IPHs52,53,54.

Fig. 5: IPHs designed from the phased potato pangenome graph.
figure 5

a, Chimeric map of the IPHs-A genome and the most ideal graph path of 12 chromosomes from heterotic A group (left). The numbers of dSVs and dSNPs for each chromosome are indicated. Haplotypes for chromosome 1 are shown in detail (right), and numbers of dSVs and dSNPs for each haplotype are represented by five parts derived from the ideal designed recombination events. The chimeric graph path of IPHs-A ‘ideal’ chromosome 1 is highlighted with red frames. Inversions and centromeres are coloured in genomic collinear blocks and chromosomes, respectively. b, Similar to a, the chimeric map of the IPHs-E genome and the ideal chimeric graph paths from the heterotic E group. c, The number of dSNPs of IPHs-A and IPHs-E, the overlap indicating the number of homozygous dSNPs in the inferred ideal F1 hybrid generated from IPHs-A and IPHs-E.

Discussion

In this study, we developed a graph-based phased potato pangenome reference comprising 60 haplotype sequences, reducing reference bias and identifying more variants that remain unaccounted by reads-based methods55,56. Our phased pangenome provides an example of a typical clonally propagated crop with a highly heterozygous genome. The integration of pedigree information and high-accuracy ultra-long reads can further improve phase accuracy in highly repetitive regions.

Consistent with observations in grapevine36, SVs are over-represented among singletons and other minor-allele frequency classes, suggesting that SVs are under stronger purifying selection. SVs can be associated with agronomic traits57,58,59, and by disrupting gene structure and cis-regulatory elements, dSVs can exert strong effects. We classified 14% of all SVs to be putative dSVs. However, inference of dSVs shows lower efficacy in certain cases owing to overlapping SVs and ambiguous ancestral state. Therefore, quantifying their fitness penalties will be crucial to guide breeding practices in the future.

Asexually reproducing crops tend to accumulate more heterozygous deleterious mutations36,49. The broken-window effect of dSVs may be reinforced owing to very limited recombination under long-term clonal propagation. After one haplotype is hit by a deleterious mutation with major effects, it seems less likely that deleterious mutations accumulate in the dSV repulsion phase, otherwise the affected individual might not survive to reproduce. This inference underscores the importance of preserving at least one functional copy of genes to maintain genetic robustness60,61. Eradicating dSVs is therefore a new focus in the development of inbred lines to generate diverse hybrid potato combinations, a principle that should also be relevant for other clonally propagated crops.

To develop inbred lines, it is more favourable to start from haplotypes of diploid landraces with a lower deleterious burden10. The relatively high deleterious burden per haplotype in a dihaploid generated from tetraploid varieties poses formidable obstacles for this step (Supplementary Fig. 21). Therefore, we plan to develop the first generation of inbred lines from diploid landraces concurrently with inducing dihaploids from elite tetraploid varieties. The IPHs strategy provides a blueprint for the combination of haplotypes to reduce the negative impact of deleterious variants. Although achieving IPHs at the chromosome level through crossing is possible, it is challenging to combine the ideal chromosomes into one genome without additional recombination (Supplementary Figs. 22 and 23). Currently, the primary value of IPHs lies in their ability to outline an optimal goal for haplotype recombination, allowing breeders to compare real haplotypes, including dSVs and dSNPs, against an idealized standard, thus facilitating refinement of breeding strategies.

Methods

Plant materials and sequencing

We selected 31 potato accessions based on phylogenetic trees generated by previous studies. Of these accessions, 10 are from S. candolleanum and 21 are from 3 groups of diploid potato cultivars. Among these potato cultivars, nine accessions represent S. tuberosum Group Phureja (including inbred line E4-63), eight accessions represent S. tuberosum Group Stenotomum (including inbred line A6-26), and four are from S. tuberosum Group Goniocalyx. Previous HiFi data are available for 21 accessions and previous Hi-C reads are available for five accessions (PRJNA754534). Additionally, we newly sequenced 10 accessions using HiFi and 25 accessions using Hi-C in this study. HiFi reads for 10 accessions were generated using the ccs program version 6.4.0 (https://github.com/PacificBiosciences/ccs) and subreads obtained from the Pacific Biosciences Sequel II platform, which were then converted to FASTQ format by SAMtools (v.1.17)62. In total, we generated 318.49 Gb of HiFi data, ranging from 27.72 to 35.32 Gb per sample. For Hi-C libraries, DNA was extracted from in vitro seedlings and digested with the restriction enzyme MboI using previously described Hi-C library preparation protocols63,64. A total of 2.70 Tb of Hi-C data were generated based on the Illumina HiSeq platform. To facilitate genome annotation, we used RNA-sequencing (RNA-seq) data from six plant tissues (including roots, stems, leaves, stolon, tubers and flowers) from a previous study13.

De novo genome assembly of 60 haplotypes

For genome size and heterozygosity estimation, we used Jellyfish (v.2.3.0)65 to obtain a frequency distribution of the k-mers and estimated the histograms by GenomeScope (v.2.0)66. We then assembled haplotype-resolved assemblies with the HiFi reads and Hi-C reads using hifiasm (https://github.com/chhylp123/hifiasm) (v.0.16)67 with default parameters. Subsequently, we aligned haplotypes using the Juicer pipeline and then generated Hi-C maps for the 3D-DNA pipeline (v.201008)68 with parameter “-q 0”. The assembly from two haplotypes was scaffolded and ordered using the Hi-C data-based rough scaffold and compared with the reference (accession DM1-3 516 R44)25. Two sets of pseudo-chromosomes were constructed for each accession, and this workflow was applied to all 29 diploid potato accessions, resulting in 58 haplotypes. For the two inbred lines E4-63 and A6-26, we generated haploid assemblies. The Hi-C format file was visualized using the Juicebox program (v.2.16.00)69 and misassembled contigs were manually curated. Finally, each haplotype was integrated into the 3D-DNA pipeline based on the run-asm-pipeline-post-review.sh script. To ensure the accuracy of the assemblies, we filtered out contigs smaller than 50 kb and used chromosome-based haplotypes for subsequent analyses.

Evaluation of the genome assemblies

The contig N50 statistic was calculated using assembly-stats (https://github.com/sanger-pathogens/assembly-stats). BUSCO (v.5.4.4)23 was used to evaluate assembly quality with the embryophyte_odb10 protein database. Both switch and Hamming errors were calculated using two variant call format (VCF) files produced by the pipeline calc_switchErr (https://github.com/tangerzhang/calc_switchErr), based on the ‘compare’ function of WhatsHap (v.1.1)70. Genome completeness was estimated via the KAT (v.2.4.2)71 ‘comp’ command. Reliability of our assemblies was estimated by Flagger, a reads-mapping-based pipeline optimized for diploid assemblies24. A detailed workflow and commands can be found in the Supplementary Information and GitHub (https://github.com/Chenglin20170390/Haplotype-diversity).

Annotation of repetitive elements

Different repeat elements are identified based on 60 haplotypes, including TEs, segmental duplications, TRs and satellite repeats. To identify TEs, we employed the Extensive de novo TE Annotator (EDTA) (v.2.1.0)72, including LTRs, DNA transposons with terminal inverted repeat sequences and helitron transposons. For the remaining transposon elements, RepeatModeler73 was used to search for a second round. To search for large segmental duplications (>1 kb), we employed Asgart (v.2.4.0)74 using a k-mer-based method with the parameter “-CRSV”. TRs were identified using Tandem Repeats Finder (v.4.09.1)75 with default parameters. Satellite repeats were identified using Satellite Repeat Finder (SRF, commit e54ca8c; https://github.com/lh3/srf) with HiFi reads.

Prediction of protein-coding genes

A comprehensive strategy consisting of transcript evidence, ab initio prediction, and homology alignment was applied for gene prediction. First, we aligned RNA-seq reads to assembled haplotypes using HISAT2 (v.2.2.1)76 with the “--dta” parameter and then assembled by StringTie (v.2.2.1)77 with the “--rf” parameter. Subsequently, we used the BRAKER2 (v.2.1.5)78 program to train the ab initio prediction model from AUGUSTUS (v.3.4.0)79 (https://github.com/Gaius-Augustus/Augustus) and collected high-quality RNA-seq hints using the Hidden Markov Model (HMM) from GeneMark-ET (v.3.67)80 with the parameter “--nocleanup --softmasking”. To improve gene structure prediction, we performed a homology search using a curated plant protein dataset downloaded from the UniProt Swiss-Prot database (https://www.uniprot.org/downloads). We merged these homologous proteins with previously published peptides from tomato81 and potato4 and eliminated potential redundancy using the CD-HIT-est (v.4.6.8)82 program with default parameters. The MAKER2 (v.3.01.03)83 program was used to combine the homology search, expression evidence and ab initio prediction through two rounds. Finally, we used the Mikado (v.2.3.4)84 program to identify the representative set of transcripts from transcript assemblies, before those were fed to the PASA pipeline (v.2.5.1)85 to update gene structures.

To perform functional gene annotation, we utilized the InterProScan (v.5.34-73.0)86 program, which identifies potential protein domains and Gene Ontology terms based on sequence signatures. We applied the following parameters to the program: “-cli -iprlookup -tsv -gotermd -appl Pfam”. In particular, we extracted protein ___domain information from Pfam by enabling the “-appl Pfam” parameter. For each of the genes, we assigned the functional description of the best hit.

Construction of the potato pangenome

The Minigraph-Cactus pipeline27 and PGGB26 were used to construct a pseudo-phased pangenome with all 61 haplotypes based on the whole-genome alignment (including the DMv6.1 reference genome). For the PGGB, we estimated the divergence of each chromosome with mash distances and confirmed chromosome community with wfmash87 mapping. Then, we used “pggb -s 10000 -n 61 -p 90 -k 47 -P asm20 -O 0.001” to build each chromosome graph. We visualized the 1D layout of the graph and estimated presence and absence ratios to the DMv6.1 reference in 100-kb sliding windows using ODGI88. The small variants and SVs were detected by vg deconstruct from snarls, and we only kept top-level and <1-Mb variants with vcfbub. For the Minigraph-Cactus pipeline, we assigned DMv6.1 as the guide for the paths, and progressively aligned the 60 haplotypes to it. We used the cactus-pangenome script with parameters “--gfa full --gbz full --vcfReference DMv6.1” to generate complete workflows and execute commands. The generated graph fragment assembly (GFA) format graph was used for edge, node and coverage statistics and subgraph generation from a BED input. The VCF output file comprises all graph variations based on the DMv6.1 reference, enabling the calculation of polymorphisms.

Pangenome size and growth

To fully capture the genome diversity of our potato populations, we used Panacus89 to assess pangenome size and growth ratio, which estimates pangenome openness directly by applying the binomial formula. We calculated cumulative bases based on quorum (minimum fraction of haplotypes sharing a graph feature after haplotypes are sequentially added to the growth histogram), and proportion of conserved (≥90% of haplotypes) and variable sequences (<90% of haplotypes) in the pangenome graphs. These statistics were obtained using Panacus hist and growth scripts with parameters “-c bp -l 1,1,1 -q 0,0.1,0.5,0.9 -S”, and we selected different haplotype paths for group-specific statistics.

Phylogeny and synteny analysis

To build haplotype phylogenetic trees at the chromosome level, annotated protein sequences from the 60 newly assembled haplotypes and three published genomes (outgroups: S. wrightii10; S. etuberosum13 and S. lycopersicum90) were aligned to produce all-versus-all alignments using Diamond (v.0.9.21)91. The gene families were inferred using the OrthoFinder (v.2.5.4)92 program, which utilizes the Markov cluster algorithm. Separately for each chromosome pair, all single-copy orthologous protein sequences were merged into a single FASTA file, which was then fed into IQ-TREE (v.2.0.6)93 using the maximum-likelihood method. Each chromosome tree was visualized with the function ggtree (v.3.0.2)94 in R (v.4.2.0). Synteny plots and pangenome annotations were generated by the R package Genespace (v.0.9)95.

Identification of SNPs, indels and SVs

Assembly-based calling was used to detect SVs with a minimum length of 50 bp. Genome sequences were aligned to the DMv6.1 reference genome to produce alignment BAM files using Minimap2 (v.2.17)96 with “-ax asm5”. SyRI (v.1.5)97 was then utilized to call genome-wide variants and generate assemblies-based VCFs. We kept variants of Ins, Del, Inv and Dup from the output of SyRI as the individual SV dataset. The accuracy of the SV dataset was validated by randomly selecting and manually verifying 100 SVs longer than 100 kb using Hi-C maps. Finally, we merged the individual SVs based on 80% overlap using SURVIVOR with the parameters “0.8 1 1 -1 -1 -1” to produce the final population SVs file. For SNPs and indels, we merged the output of assembly-based variation calling from SyRI using BCFtools (v.1.13)98.

Haplotype diversity and GHSL statistic

The length of the haplotype-specific (heterozygous) variants for each diploid accession was calculated by summing all variants present in only one haplotype. In detail, we aligned the two haplotypes of one accession to each other using Minimap2 (v 2.17), identified variants by SyRI (v.1.5), and then calculated the number of haplotype-specific SNPs and the length of haplotype-specific variants (indels and SVs). The GHSL approach considers reference bias and regions potentially affected by haplotype-specific variants (https://github.com/Chenglin20170390/Haplotype-diversity). PCA based on haplotypes (12 chromosomes) was performed using Plink (v.1.90)99 on the VCF file with the parameter “--pca 5”; the results were visualized using the R package ggplot2100. Pairwise Jaccard similarities between haplotypes were estimated using BEDTools (v.2.30.0) with the ‘jaccard’ command.

SV dynamics in potato domestication

To calculate the unfolded site frequency spectrum, we analysed 19 cultivated potato accessions (excluding the two inbred lines) from VCFs. The genome of S. lycopersicum (SL 5.090) was used as an outgroup to infer the ancestral states of genotypes. We distinguish different types of SNPs (synonymous and non-synonymous) and SVs (deletions, insertions, inversions and duplications) and counted the number of haplotypes carrying the derived variants. For each accession, we also calculated the incidences (numbers) of heterozygous and homozygous SVs and the sums of all SVs (twice the number of homozygous SVs plus the number of heterozygous SVs).

Identification of TEMRs

SV breakpoints with additional ±100-bp flanking sequences were used to identify repeat elements (TEs, segmental duplications and TRs) associated with SVs. To detect the potentially causal relationship between TE movement and SV formation, the documented genomic position of a focal SV and overlapping TEs determine the following categories: ‘No TE SV’ for SVs without any TE overlap, ‘Incomplete TE SVs’ for SVs overlapping with a single TE with <95% coverage, ‘Single TE SVs’ for SVs overlapping with a single TE with coverage ≥95%, and ‘Multi TE SVs’ for SVs overlapping with multiple TEs with coverage >95%. Since no single TE SV events were considered, insertions were excluded. To quantify the formation of SVs (deletions, inversions and duplications) by ectopic recombination via TEMRs, SV breakpoints (±100 bp) overlapping with the identical class of TEs were calculated. To assess the distribution of repeats in SV breakpoints, we randomly simulated SVs using the shuffle function of BEDTools with the number and length of SVs set according to the potato SV variation map. Two-sided Student’s t-tests were used to test the significance of the proportions between observed and simulated TE-related SVs. To infer the insertion times of TEMRs, we extracted insertion times from the pass.list of EDTA and kept intact TE sequences overlapping with the breakpoints of SVs. Sequence homology of TEMRs was calculated using 200-bp flanking sequences based on global pairwise sequence alignment from Needle (https://github.com/nanjiangshu/my_needle).

Haplotype recombination in inbred lines

The fifth-generation inbred lines A6-26 and E4-63 were generated from the diploid cultivars E86-69 and PG6359, respectively. To evaluate haplotype recombination events in each inbred line, we utilized two different methods: heterozygosity analysis and haplotype-specific k-mer analysis. Genome-wide heterozygous peaks were identified by aligning the sequences of the haplotypes. The different heterozygous peaks were compared between haplotypes from inbred lines and their founders. The regions inherited from the founders’ haplotypes in the alignment would be in homozygous state. These switch signals from homozygosity to heterozygosity were used to characterize putative recombination events. Haplotype-specific k-mers were calculated by the number of specific k-mers from the inbred line based on non-overlapping 500-kb regions and compared with parental haplotypes using the Meryl program (available at https://github.com/marbl/meryl).

Identification of dSVs and dSNPs

Putative dSVs were identified based on three criteria: (1) ancestral versus derived state of SVs. S. lycopersicum (SL 5.0)90 was used as the outgroup for ancestral state inference, as phylogenetic analyses have placed this species in a clade relatively close to Solanum section Petota13. (2) SVs with frequencies below 0.05 in our 60 haplotypes were considered, as most deleterious variants occur at low population frequencies16,42,43. (3) SVs in coding regions that may damage proteins’ function or SVs in the 92 Solanaceae evolutionarily constrained regions are considered to be deleterious10. We thus identified putative dSVs as being derived-state SVs with low-frequency (<0.05) that overlap with coding regions or evolutionarily constrained regions; we implemented this with the BEDTools ‘overlap’ command to infer putative dSVs. Putative dSNPs were identified by showing GERP values >2.75 and overlapping with evolutionarily constrained regions10.

Estimation of the broken-window effect

The Pearson correlation coefficient was used to examine the correlation between the number of dSNPs and dSVs per chromosome. Focal dSVs located in haplotype 1 (H1) indicates H1 was considered to be the dSV coupling phase and corresponding syntenic regions of H2 were considered to be the dSV repulsion phase. We calculated the distance of the focal dSV to the nearest dSV in coupling phase and to the nearest dSV in repulsion phase (Extended Data Fig. 9). To exclude the potential concurrent influence of dSVs on both coupling and repulsion phases, we segmented chromosomes into fragments based on the midpoint between adjacent dSVs (irrespective whether in coupling or in repulsion phase). Regions that may be affected by the broken-window effect were estimated based on 10-kb non-overlapping sliding windows. For each sliding window, we calculated the number of dSNPs flanking each dSV in the coupling and repulsion phases. To assess dSNP enrichment, for each sliding window we performed a two-tailed Student’s t-test for the numbers of associated dSNPs in coupling and repulsion phase, respectively (regions significantly affected by the broken-window effect of dSVs). The depth of reads near the dSV is calculated using the SAMtools (v.1.17) ‘depth’ command. Finally, we employed SafFire (https://github.com/mrvollger/SafFire) to visualize schematic representations of dSVs and dSNPs within haplotypes.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.