PacBio genome assembly of Olea europaea L. subsp. europaea cultivars Frantoio and Leccino

Sebastiani, Luca; Sarfraz, Iqra; Francini, Alessandra; Celii, Mirko; Wing, Rod A.; Zuccolo, Andrea

doi:10.1038/s41597-025-05363-4

Download PDF

Data Descriptor
Open access
Published: 01 July 2025

PacBio genome assembly of Olea europaea L. subsp. europaea cultivars Frantoio and Leccino

Scientific Data volume 12, Article number: 1095 (2025) Cite this article

86 Accesses
3 Altmetric
Metrics details

Subjects

Abstract

We present two high-quality genome assemblies for Olea europaea L. cultivars ‘Frantoio’ and ‘Leccino,’ leveraging PacBio HiFi sequencing to achieve approximately 30 × genome coverage for each cultivar. The assemblies span 1.18 Gbp and 1.43 Gbp with contig N50 values of 1.78 Mbp and 45.88 Mbp for ‘Frantoio’ and ‘Leccino,’ respectively. BUSCO analysis revealed a great genome completeness (~97.9%), surpassing many of earlier Olea europaea assemblies and is in par with the most recent one. Repetitive content accounted for ~67.5% in ‘Frantoio’ and ~70.8% in ‘Leccino,’ with long terminal repeats (LTRs) dominating. Notably, a tandem repeat family, Satellite 1, represented ~16.9% and ~8.6% of the ‘Leccino’ and ‘Frantoio’ genomes, respectively. The structural variant (SV) analysis identified cultivar-specific genomic differences, emphasizing the diversity within domesticated olives. This comprehensive analysis provides valuable resources for studying olive genome evolution, domestication, and genetic improvement, underscoring the utility of long-read sequencing for resolving complex genomic features.

De novo assembly of a new Olea europaea genome accession using nanopore sequencing

Article Open access 01 April 2021

Chromosome-level genome assemblies of sunflower oilseed and confectionery cultivars

Article Open access 07 January 2025

Chromosome-level assemblies of the endemic Korean species Abeliophyllum distichum and Forsythia ovata

Article Open access 18 December 2024

Background & Summary

Olive (Olea europaea L. subsp. europaea) is an economically relevant and widely distributed fruit crop in the Mediterranean Basin. This iconic tree has its origins linked to the beginning of ancient Mediterranean civilizations dated more than six millennia ago¹ and its domestication is still debated in scientific literature². Thanks to the healthy and organoleptic properties of the extra virgin olive oil, in the last decades olive orchards spread in many other warm-temperate regions of the world such as North and South America, Australia, New Zealand, and South Africa, and even in the monsoon areas likes China and India.

Despite the environmental, cultural, economic and scientific value of this specie, the availability of a high-quality genome data for relevant cultivars are still scarce. Additionally, the selection of olive varieties using traditional breeding practices is a time consuming and largely random process, since precise molecular information on genes ___location and structure are largely missing. Today, third generation sequencing techniques³ generating very long reads allow the high-quality complete assembly of complex genomes. These genome assemblies in turn enable a deeper understanding of genome structure and provide foundational data sets for functional genomics, genetic engineering and molecular breeding. All these aspects are particularly relevant in woody crops, like olive, where genome data are missing, scarce, or when present are often characterised by lower quality in comparison to other plants/crops. Additionally, olive has a complex mid-size genome characterized by high heterozygosity and high repeat content.

In olive, the first research aimed to achieve deeper genomic information was done using Sanger and 454 pyrosequencing technologies and targeted transcriptome⁴. Using this approach, 2 million reads from 12 cDNA libraries were attained from several olive cultivars and progenies. These libraries were done for fruit in different developmental stages, vegetative organs (stems, leaves, roots) and buds. In 2016⁵ using a blend of fosmid and whole genome shotgun libraries and Illumina sequencing technology, sequenced the genome of a single 1200-year-old Mediterranean olive tree (Olea europaea L. subsp. europaea var. europaea cultivar ‘Farga’). This genome assembly has a total length of 1.31 Gb that correspond to 95% of the estimated 1.38 Gb Olive genome size and about 56,349 unique protein coding genes were predicted. This first assembled draft genome of Olea europaea was a milestone for the study of the evolution and domestication processes of olive and gave new insight in the genetic bases of key phenotypic traits relevant to agronomic and stress tolerance characteristics⁶. More recently the olive cultivar Arbequina, suitable for mechanized harvesting and dense planting, was sequenced using Oxford Nanopore third generation sequencing⁷. Authors assembled 1.1 Gb of sequences organized in 23 pseudochromosomes and predicted 53,518 protein-coding genes. The greater contiguity of this genome assembly allowed the identification of 202 genes part of the oleuropein biosynthesis pathway genes which is twice the number of those identified from previous genomic data. More recently⁸, provided a gapless genome assembly for O. europaea cultivar ‘Leccino’ exploiting that resource and transcriptomic and metabolomic data to unvail a pivotal regulatory mechanism in oil biosynthesis. This evidence once more underlines the advantages provided by high-quality genome assembly for genomics studies. The accessibility of high-quality genome sequences today provides an unprecedented opportunity to compare cultivars, enabling the identification of genetic variations underlying traits such as yield, disease resistance, and environmental adaptation. Building on this potential, we targeted the high-quality genome assembly of two important olive cultivars, ‘Leccino’ and ‘Frantoio’using PacBio HiFi long reads with 27X and 29X coverage, respectively, along with their baseline gene and repeat annotation, including transposable elements.

Methods

Sampling and genome sequencing

Young leaves were sampled from 1 year old Olea europaea L. cultivar ‘Frantoio’ and ‘Leccino’ plants (supplied from Società Pesciatina d’Olivicoltura and certified according to the Community Agricultural Conformity), preserved in liquid nitrogen and stored at −80 °C until subsequent analysis. High Molecular Weight (HMW) DNA was extracted using a modified CTAB protocol⁹. The quality of the extracted DNA was assessed through pulsed-field gel electrophoresis (CHEF) on 1% agarose gels to evaluate fragment size and restriction enzyme digestibility. Quantification was performed using Qubit fluorometry (Thermo Fisher Scientific, Waltham, MA). Sequencing was performed using PacBio Revio System at the Arizona Genomics Institute (Tucson, AZ), on Revio SMRT cells¹⁰. High-quality sequencing data generated, 37.5 Gbp of HiFi reads for ‘Frantoio’ and 38.7 Gbp for ‘Leccino’, with read N50 values of 5.9 kbp and 16.3 kbp, respectively. This data provided approximately 30x genome coverage for both cultivars, supporting robust genome assemblies.

Genome size and heterozygosity estimation

The K-mer analysis was performed using Jellyfish v2.3.0¹¹ setting the K-mer length at 31. The resulting K-mer frequency distributions were processed using Genomescope2¹². The K-mer frequency distribution showed two distinct peaks (Fig. 1), the first peaks at approximately 12x-13.3x coverage, corresponded to heterozygous regions, while the second peak, at around 24x-26x coverage, reflected homozygous regions. In ‘Frantoio’, the heterozygosity rate (ab) was estimated at 2.2%, with a duplication rate (dup) of 0.0914; in ‘Leccino’ these values were 1.85% and 0.153 respectively. These heterozygosity values are higher than those reported for Arbequina i.e., 1.09%⁷, suggesting greater diversity in ‘Frantoio’ and ‘Leccino’. The K-mer analysis indicated high repetitive content, with 51.6% and 52.2% of unique sequences ‘Frantoio’ and ‘Leccino’, respectively. The significant repetitive content in both cultivars’ genome, emphasize the need for PacBio HiFi sequencing for resolving complex regions, to achieve contiguous, accurate assemblies.

Genome assembly and scaffolding

PacBio HiFi reads were assembled using hifiasm/v0.19.8¹³ with default parameters and redundant haplotigs were removed using “purge haplotigs”¹⁴. The contigs contained in the primary assembly were scaffolded using the tool RagTag/v2.1.0 as described by Alonge et al.¹⁵. For the whole-genome alignment, the built-in Minimap2 inbuilt aligner was used¹⁶. The genome assembly for ‘Frantoio’ was around 1,18 Gbp with 1,726 contigs and an N50 of 1.78 Mbp, whereas ‘Leccino’ was around 1,43 Gbp with 103 contigs and an N50 of 45.86 Mbp (Table 1)

Table 1 Main statistics of draft genome assembly for Olea europaea cultivars ‘Frantoio’ and ‘Leccino’.

Full size table

The genome size of our genome assemblies compared well with those provided by the K-mers analysis (~1.29 Gbp for ‘Frantoio’ and ~1.31 Gbp for ‘Leccino’) and align closely to those of the previous published olive genome assemblies such as Olea europaea L. subsp. europaea cv. ‘Farga’ (~1.31 Gb)⁵, Arbequina (~1.3 Gb)⁷, and the recently released ‘Leccino’ (~1.28 Gb)⁸. These estimates confirm that the genome size of cultivated olive is smaller in comparison to that of wild oleaster (Olea europaea subsp. ‘sylvestris’), which is approximately 1.48 GB⁷.

The scaffolding of the ‘Frantoio’ assembly and ‘Leccino’assembly was carried out in a stepwise approach using multiple reference genomes to aseess the assembly quality and accuracy. These references includes the wild olive (Olea europaea subsp. sylvestris RefSeq assembly GCF_002742605.1, O_europea_v1), the cultivated olive ‘Farga’ (Olea europaea subsp. europaea genome assembly OLEA9, 2020) and the ‘Leccino’ genome (Olea europaea subsp. europaea genome assembly GCA-902713445.1) recently published⁸.

In the first step, Olea europaea subsp. sylvestris was used as a reference. Scaffolding was performed using Ragtag Scaffold 2.1.0¹⁵. The scaffold coverage achieved was 86.02% for ‘Frantoio’, and 84.58% for ‘Leccino’ (Fig. 2). A notable difference was observed in the number of structural gaps with 605 gaps in ‘Frantoio’, and 13 gaps in ‘Leccino’.

In the second step, the ‘Farga’ genome was used as a reference for scaffolding. Scaffold coverage decreased to 40.92% for ‘Frantoio’ and 29.5% for ‘Leccino’, both lower than the coverage achieved with Olea europaea subsp. sylvestris (Fig. 3). Structural gaps number is lower, with 504 gaps identified in ‘Frantoio’ and 13 gaps in ‘Leccino’.

In the final step, the ‘Frantoio’ genome assembly and ‘Leccino’ genome assembly were scaffolded using published ‘Leccino’⁸ as a reference. The scaffold coverage was 82.43% for ‘Frantoio’ and 93.37% for ‘Leccino’ (our data) and 1,208 structural gaps with ‘Frantoio’ and only 33 gaps with the ‘Leccino’ (this study data) (Table 2 and Fig. 4).

Table 2 Scaffolding results of ‘Frantoio’ and ‘Leccino’ assemblies using ‘Farga’, Olea europaea subsp. sylvestris and ‘Leccino’ published as a reference.

Full size table

The final Scaffold of ‘Frantoio’ displays a big gap on Chr01 (Fig. 4b), however the analyses of the alignment of the unplaced contigs over the reference shows that contig ptg000955_1 has homology for a fraction of the gap region (Fig. 5a) and thus we decided to incorporate this contig in the Chr01 scaffold. AGP file generated by ragtag was modified by inserting the contig ptg000955_1 after ptg0031901_1 on the + strand of Chr01, which shifted all the subsequent contigs of that chromosome. The edited scaffold was re-scaffolded again over the reference to verify the partial gap filling (Fig. 5b).

Benchmarking universal single copy orthologs (BUSCO)

BUSCO/v5.7.1¹⁷ with eudicotyledons_odb10 database¹⁸, which comprises 1,614 orthologous genes, was used to assess the completeness of the genome assembly, calculating the percentage of single copy, duplicated, fragmented, and missing genes (Table 3).

Table 3 BUSCO main statistics as percentage (%) of the total (n = 1,614) BUSCO groups searched in Olea europaea L. genome cultivars ‘Frantoio’ and ‘Leccino’.

Full size table

Complete BUSCO gene copies were almost equals (Fig. 6): 1,580 for ‘Frantoio’ and 1,581 for ‘Leccino’, corresponding to approximately 97.9% of the entire BUSCO gene set.

These results are higher than the values reported in reference study^5,19 with ‘Farga’ (92.99%), ‘Arbequina’ (92.87%) and Olea europaea subsp. sylvestris (85.50%) but closer (99.93%) to that obtained by Lv et al.⁸ in the cultivar ‘Leccino’. Single copy accounted for 83.1% (1,342 copies) in ‘Frantoio’ and 82.9% (1,338 copies) in ‘Leccino’. The number of duplicated ortholog was 14.7% (238 copies) for ‘Frantoio’ and 15.1% (243 copies) for ‘Leccino’ of the total BUSCO groups searched, respectively, which are comparable to Arbequina where the duplication rate was reported as 20.38% and in ‘Farga’ with 18.15% of duplication respectively and in good agreement with the 12.8% of duplicated genes reported in the cultivar ‘Leccino’ Lv et al.⁸. In contrast, the wild relative Olea europaea subsp. sylvestris exhibited a much higher duplication of 37.98%⁷.

The Fragmented orthologs accounted for 1.4% (22 copies) in both cultivars, while the missing copies were limited to 0.8% (12 copies) in ‘Frantoio’ and 0.6% (11 copies) in ‘Leccino’. These values are lower than those observed in other accessions; for example, 2.42% fragmentation is reported in ‘Arbequina’ and while Olea europaea subsp. sylvestris exhibited 6.69% fragmentation and 7.81% missing BUSCOs⁷. This likely reflects the high contiguity and accuracy achieved in these assemblies, which reduces the incidence of fragmented and missing gene copies and enhance the completeness of the genome. Altogether the BUSCO statistics testify the high completeness and contiguity achieved in our genome assemblies.

Repeats and transposable elements identification

The Extensive de novo TE annotator (EDTA)/v2.1.0 was used to generate a Transposable Element (TE) library²⁰. All the Helitron predictions were removed from the TE library because their identification is quite imprecise and prone to generate false positives²⁰. Following this, the RepeatMasker/v4.1.4²¹ was run with the default settings to mask the repetitive sequences on the genome assemblies.

The TE content of the Olea europaea L. assembly was estimated (Table 4) to be 67.47% in ‘Frantoio’ and 70.84% in ‘Leccino’. These values accounted for ~769 Mb and ~1,006 Mb in ‘Frantoio’ and ‘Leccino’ respectively. This different amount of repetitive and TE related sequences explains most of the diverse genome size of the two cultivars. The TE and repetitive sequence fraction is almost in the same range of previously sequenced olive genomes, 66.30% for Olea europaea cv. ‘Leccino’⁸, 59% cultivated Olea europaea cv. ‘Picual’²², but significantly larger than Olea europaea subsp. sylvestris, with 51% of genome composed of repetitive DNA¹⁹.

Table 4 Transposable elements classification according to Wicker et al.³⁷ in Olea europaea cultivars ‘Frantoio’ and ‘Leccino’.

Full size table

In both cultivar the largest TE class is represented by LTR-RT accounting for 34.07% and 29.62% of the genome size in ‘Frantoio’ and in ‘Leccino’, respectively. In both cultivars Ty3-gypsy superclass always outnumber Ty1-copia one. Altogether Class 2 DNA TE represent 15.6% and 14.23% of the genome assembly size in ‘Frantoio’ and ‘Leccino’, respectively. In both case the most abundant family seems to be that of Mutator like elements totalling 8.1% in ‘Frantoio’ and 8.61% in ‘Leccino’ assembly. Several elements initially classified as Mutator DNA-TEs by EDTA exhibited characteristics inconsistent with this TE family. In particular, these elements appeared to be arranged in tandem over long stretches of DNA, as demonstrated by dot plot analysis (Fig. 7), which is not typical of Mutator DNA-TEs, as they are usually scattered throughout the genome. This study reanalyzed these sequences using BLASTN searches against the NCBI nr database and the PlantSat database²³, revealing that they matched known satellite repeats and were subsequently reclassified.

This was the case of three abundant satellite sequences we identified and named Satellite_1, Satellite_2 and Satellite_3 (Table 5).

Table 5 The sequences of the three most abundant satellite repeats identified in ‘Frantoio’ and ‘Leccino’ genomes.

Full size table

Satellite_1 is an 80 bp long minisatellite that represents approximately 8.62% and 16.90% of the genome assemblies of the ‘Frantoio’ and ‘Leccino’, respectively. This corresponds to about 102 Mbp in ‘Frantoio’ and 242 Mbp ‘Leccino’ genome assembly. The copy number of Satellite_1 is 1.28 million copies in ‘Frantoio’ and 3.03 million copies in ‘Leccino’. Satellite_2 is a tandem repeat characterized by a 141 bp long monomer which covers approximately 4.65% of the ‘Frantoio’ genome, corresponding to 55,565,954 bp, and 5.78% of the ‘Leccino’ one, corresponding to 82,789,861 bp. The amount of Satellite_2 copies is estimated to be 394,000 and 587,000 in ‘Frantoio’ and ‘Leccino’, respectively. Finally, Satellite_3 has a 107 bp long monomer. It covers 41.01 Mbp and 61.74 Mbp of ‘Frantoio’ and ‘Leccino’ genome assemblies, respectively. The estimated copy number is ~383,000 in ‘Frantoio’ and ~604,000 in ‘Leccino’ (Table 4). This evidence confirms the finding that a large portion of the olive genome is composed by few different families of tandemly arranged repeats²⁴.

Gene prediction and functional annotation

The gene prediction was carried out using the tool Augustus²⁵ as implemented in the suite Omicsbox²⁶. The genome assemblies soft masked for repeats were used as the primary input and the gene structure devised in Arabidopsis thaliana was used as a model. The Functional annotation was done comparing the predicted genes to the nr division of GeneBank using Diamond blast /v2.1.8²⁷ and analyzing them with InterProScan/v5.61–93.0²⁸. The parameters used for Diamond blast were Blast Mode = blastp, Sensitivity Mode = Standard, Database = NR, Blast e-value = 1.0E⁻³. For InterProScan we used the Cloud InterProScan with up to date EMBL-EBI-InterPro data including CCD, HHM-Pfam and HMMPIR models.

Gene prediction found 59,777 genes in ‘Frantoio’. Out of the 47,201 of them considered highly reliable (posterior probability > 0.4) 37,061 were successfully functionally annotated. In ‘Leccino’, 67,103 genes were found and out of the 53,302 with high quality (posterior probability > 0.4) 37,606 were functionally annotated (Table 6).

Table 6 Functional annotation of predicted genes for ‘Frantoio’ and ‘Leccino’.

Full size table

The discrepancy in the number of genes predicted in the two cultivars could likely reflect the greater fragmentation of the cultivar ‘Frantoio’ genome assembly. These values are comparable with those of Cruz et al.⁵, where the Authors found a set of 56,349 protein coding genes, with 89,982 transcripts encoding 79,910 unique protein products and those of the recently published cultivar ‘Leccino’ genome assembly in which 70,138 protein encoding genes were identified⁸. In Arbequina⁷ genome assembly, the predicted protein-coding genes were 53,518 and Authors successfully annotated 50,969 genes using GO, Kyoto Encyclopedia of Genes and Genomes (KEGG), Eukaryotic Orthologous Groups (KOG), TrEMBL, and Nonredundant (Nr) databases. Overall, gene predictions and annotations showed that in Olea europaea L. subsp. europaea the protein-coding genes number range from 67,103 to 53,518, with the higher values in ‘Leccino’ and the lowest values in ‘Arbequina’.

Structural variant analysis

The ‘Frantoio’ genome assembly was aligned to the ‘Leccino’ assembly using minimap2/v2.24 2¹⁶ under default parameters. The resulting sam file was sorted and indexed using samtools/v1.16.1²⁹, subsequently, the SVIM-asm tool was used to detect SVs, including insertion, inversion, deletion, duplication, interspersed duplication and tandem duplication³⁰.

The total number of deletions, insertions and inversions in Olea europaea genome of cultivar ‘Leccino’ in comparison to cultivar ‘Frantoio’ are presented in Fig. 8. Deletions and insertions are very similar in number (22,469 deletions and 21,218 insertions), while the number of inversions is very low (n = 33). More than 85% of the insertions and deletions exhibit high similarity to TEs or other repetitive sequences. The enrichment of TEs in structural variants, compared to the entire genome, highlights the contribution of these elements to genome variation. According to data the two cultivars ‘Frantoio’ and ‘Leccino’ exhibit a considerable amount of genetic variation, consistent with previous findings obtained using Simple Sequence Repeats (SSRs) markers³¹.

In conclusion, the assembly of ‘Frantoio’ and ‘Leccino’ provided in this study could be a valuable dataset for studying cultivar differentiation in traits that are useful for olive genetic improvement. Together with others published genomes, these data will enable further understanding of olive evolution and domestication processes.

Data Records

All the raw sequencing data and the genome assemblies have been submitted to NCBI.

PcBio HiFi reads of Olea europaea cv. ‘Leccino’ NCBI Sequence Read Archive under SRP551229³².

Olea europaea cv. ‘Leccino’ genome assembly data on NCBI GenBank under GCA_048165045.1³³.

PcBio HiFi reads of Olea europaea cv. ‘Frantoio’ NCBI Sequence Read Archive under SRP551226³⁴.

Olea europaea cv. ‘Frantoio’ genome assembly data on NCBI GenBank GCA_048169195.1³⁵.

The variant data for this study have been deposited in the European Variation Archive (EVA) at EMBL-EBI under accession number ERP172050³⁶.

Technical Validation

The quality and concentration of extracted DNA were assessed using NanoDrop Spectrophotometer and FEMTO size profile before the genome sequencing.

After the genome assembly was completed, the assembly results were evaluated:

i).
The HiFi reads used for genome assemblies were mapped back onto to the assembled genome. The alignment rate of reads was in both cases higher than 99%, showing high consistency between the reads and assembled genomes.
ii).
the completeness of the genome assemblies was evaluated using the BUSCOs eudicotyledons_odb10 database.

Code availability

The manuscript did not use custom code to generate or process the data described except for final scaffolding of ‘Frantoio’ and ‘Leccino’ as a reference, the AGP file generate by ragtag was modified by inserting the contig ptg000955_1 after ptg0031901_1 on the + strand of Chr01, which shifted all the subsequent contigs of that chromosome. The final scaffold was then constructed using RAGTAG’s agp2fasta tool, Software and pipelines were executed according to the manual and protocols. Figures 2a,3a,4a were created by plotting the scaffold lengths and gap position from the respective ragtag.scaffold.agp files versus the reference chromosome length. Figures 2b,3b,4b,5b were created by plotting the alignment segments position of each scaffold from the respective ragtag.scaffold.asm.paf files. Similarly, Fig. 5a was created by plotting the alignment segments position of the 20 longest unplaced contigs >100 Kb from the respective ragtag.scaffold.asm.paf files. Complete code is available at https://github.com/mirkocelii/GS-viewer.

References

Kaniewski, D. et al. Primary domestication and early uses of the emblematic olive tree: palaeobotanical, historical and molecular evidences from the Middle East. Biological Reviews 87, 885–899 (2012).
Article PubMed Google Scholar
Besnard, G., Terral, J. F. & Cornille, A. On the origins and domestication of the olive: a review and perspectives. Annals of Botany 121, 385–403 (2018).
Article PubMed Google Scholar
Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics 13, 278–289 (2015).
Article PubMed PubMed Central Google Scholar
Muñoz-Mérida, A. et al. De Novo Assembly and Functional Annotation of the Olive (Olea europaea) Transcriptome. DNA Research 20, 93–108 (2013).
Article PubMed PubMed Central Google Scholar
Cruz, F. et al. Genome sequence of the olive tree, Olea europaea. GigaScience 5, 29 (2016).
Article PubMed PubMed Central Google Scholar
Sebastiani, L. & Gucci, R. in: The Olive: Botany and Production (ed. Fabbri, A, Baldoni, Caruso T., Famiani F.) Ch 7.1 (CABI Digital Library, 2023).
Rao, G. et al. De novo assembly of a new Olea europaea genome accession using nanopore sequencing. Horticulture Research 8, 64 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lv, J. et al. The gapless genome assembly and multi-omics analyses unveil a pivotal regulatory mechanism of oil biosynthesis in the olive tree. Horticulture Research 11, uhae168 (2024).
Article CAS PubMed PubMed Central Google Scholar
Porebski, S., Bailey, L. G. & Baum, B. R. Modification of a CTAB DNA extraction protocol for plants containing high polysaccharide and polyphenol components. Plant Molecular Biology Reporter 15, 8–15 (1997).
Article CAS Google Scholar
PacBio Revio [WWW Document] URL https://genohub.com/ngs-sequencer/4/pacbio-revio/ (accessed 11.14.24) (2024).
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Article PubMed PubMed Central Google Scholar
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communication 11, 1432 (2020).
Article CAS Google Scholar
Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nature Biotechnology 40, 1332–1335 (2022).
Article CAS PubMed Google Scholar
Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics 19, 460 (2018).
Article CAS PubMed PubMed Central Google Scholar
Alonge, M. et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biology 23, 258 (2022).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094 (2018).
Article CAS PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Article CAS PubMed PubMed Central Google Scholar
Unver, T. et al. Genome of wild olive and the evolution of oil biosynthesis. The Proceedings of the National Academy of Sciences of the United States of America 114, E9413–E9422 (2017).
CAS PubMed Google Scholar
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biology 20, 275 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tempel, S. In: Mobile Genetic Elements: Protocols and Genomic Applications. (ed. Bigot, Y.) Ch 2 (Humana Press, Totowa, NJ, 2012).
Jiménez-Ruiz, J. et al. Transposon activation is a major driver in the genome evolution of cultivated olive trees (Olea europaea L.). Plant Genome 13, e20010 (2020).
Article PubMed Google Scholar
Macas, J., Mészáros, T. & Nouzová, M. PlantSat: a specialized database for plant satellite repeats. Bioinformatics 18, 28–35 (2002).
Article CAS PubMed Google Scholar
Barghini, E. et al. The peculiar landscape of repetitive sequences in the olive (Olea europaea L.) genome. Genome Biology and Evolution 6, 776–791 (2014).
Article PubMed PubMed Central Google Scholar
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Research 34, W435–W439 (2006).
Article CAS PubMed PubMed Central Google Scholar
OmicsBox – Bioinformatics Made Easy, BioBam Bioinformatics (2019).
Buchfink, B., Reuter, K. & Drost, H. G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods 18, 366–368 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Article CAS PubMed PubMed Central Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtolls. GigaScience 10, giab008 (2021).
Article PubMed PubMed Central Google Scholar
Heller, D. & Vingron, M. SVIM-asm: structural variant detection from haploid and diploid genome assemblies. Bioinformatics 26, 22–23 (2020).
Google Scholar
Bracci, T. et al. SSR markers reveal the uniqueness of olive cultivars from the Italian region of Liguria. Scientia Horticulturae 122, 209–215 (2009).
Article CAS Google Scholar
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP551229 (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_048165045.1 (2025).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP551226 (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_048169195.1 (2025).
EMBL-EBI European Variation Archive https://identifiers.org/ena.embl:ERP172050 (2025).
Wicker, T. et al. A unified classification system for eukaryotic transposable elements. Nature reviews genetics 8, 973–982 (2007).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This study was carried out within the Agritech National Research Center and received funding from the European Union Next-GenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR)–MISSIONE 4 COMPONENTE 2, INVESTIMENTO 14–DD 1032 17/06/2022, CN00000022). This manuscript reflects only the authors’ views and opinions, neither the European Union nor the European Commission can be considered responsible for them. Iqra Sarfraz Agrobioscience PhD Scholarship was funded by Programma Operativo Nazionale Ricerca e Innovazione 2014–2020 (CCI 2014IT16M2OP005), FSE REACT-EU, Azione IV.4 “Dottorati e contratti di ricerca su tematiche dell’innovazione” e Azione IV.5 “Dottorati su tematiche Green”.

Author information

Authors and Affiliations

Institute of Crop Science, Scuola Superiore Sant’Anna, Piazza Martiri della Libertà 33, 56127, Pisa, Italy
Luca Sebastiani, Iqra Sarfraz, Alessandra Francini & Andrea Zuccolo
King Abdullah University of Science & Technology, Thuwal, Saudi Arabia
Mirko Celii, Rod A. Wing & Andrea Zuccolo
Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, AZ, USA
Rod A. Wing

Authors

Luca Sebastiani
View author publications
Search author on:PubMed Google Scholar
Iqra Sarfraz
View author publications
Search author on:PubMed Google Scholar
Alessandra Francini
View author publications
Search author on:PubMed Google Scholar
Mirko Celii
View author publications
Search author on:PubMed Google Scholar
Rod A. Wing
View author publications
Search author on:PubMed Google Scholar
Andrea Zuccolo
View author publications
Search author on:PubMed Google Scholar

Contributions

L.S: Conceptualization, Methodology, Resources, Formal analysis, Data curation, Software, Writing-original draft, Project administration, Funding acquisition, Writing- review & editing. I.S.: Methodology, Investigation, Validation, Formal analysis, Data curation, Software, Visualization, Writing-original draft, Writing- review & editing. A.Z.: Conceptualization, Methodology, Validation, Resources, Formal analysis, Data curation, Software, Visualization, Funding acquisition, Writing- review & editing. A.F.: Conceptualization, Writing- review & editing. M.C.: Methodology, Validation, Data curation, Software, Visualization, Writing- review & editing. R.A.W.: Methodology, Resources, Writing- review & editing.

Corresponding authors

Correspondence to Luca Sebastiani or Iqra Sarfraz.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Sebastiani, L., Sarfraz, I., Francini, A. et al. PacBio genome assembly of Olea europaea L. subsp. europaea cultivars Frantoio and Leccino. Sci Data 12, 1095 (2025). https://doi.org/10.1038/s41597-025-05363-4

Download citation

Received: 20 January 2025
Accepted: 06 June 2025
Published: 01 July 2025
DOI: https://doi.org/10.1038/s41597-025-05363-4