Abstract
We present two high-quality genome assemblies for Olea europaea L. cultivars ‘Frantoio’ and ‘Leccino,’ leveraging PacBio HiFi sequencing to achieve approximately 30 × genome coverage for each cultivar. The assemblies span 1.18 Gbp and 1.43 Gbp with contig N50 values of 1.78 Mbp and 45.88 Mbp for ‘Frantoio’ and ‘Leccino,’ respectively. BUSCO analysis revealed a great genome completeness (~97.9%), surpassing many of earlier Olea europaea assemblies and is in par with the most recent one. Repetitive content accounted for ~67.5% in ‘Frantoio’ and ~70.8% in ‘Leccino,’ with long terminal repeats (LTRs) dominating. Notably, a tandem repeat family, Satellite 1, represented ~16.9% and ~8.6% of the ‘Leccino’ and ‘Frantoio’ genomes, respectively. The structural variant (SV) analysis identified cultivar-specific genomic differences, emphasizing the diversity within domesticated olives. This comprehensive analysis provides valuable resources for studying olive genome evolution, domestication, and genetic improvement, underscoring the utility of long-read sequencing for resolving complex genomic features.
Similar content being viewed by others
Background & Summary
Olive (Olea europaea L. subsp. europaea) is an economically relevant and widely distributed fruit crop in the Mediterranean Basin. This iconic tree has its origins linked to the beginning of ancient Mediterranean civilizations dated more than six millennia ago1 and its domestication is still debated in scientific literature2. Thanks to the healthy and organoleptic properties of the extra virgin olive oil, in the last decades olive orchards spread in many other warm-temperate regions of the world such as North and South America, Australia, New Zealand, and South Africa, and even in the monsoon areas likes China and India.
Despite the environmental, cultural, economic and scientific value of this specie, the availability of a high-quality genome data for relevant cultivars are still scarce. Additionally, the selection of olive varieties using traditional breeding practices is a time consuming and largely random process, since precise molecular information on genes ___location and structure are largely missing. Today, third generation sequencing techniques3 generating very long reads allow the high-quality complete assembly of complex genomes. These genome assemblies in turn enable a deeper understanding of genome structure and provide foundational data sets for functional genomics, genetic engineering and molecular breeding. All these aspects are particularly relevant in woody crops, like olive, where genome data are missing, scarce, or when present are often characterised by lower quality in comparison to other plants/crops. Additionally, olive has a complex mid-size genome characterized by high heterozygosity and high repeat content.
In olive, the first research aimed to achieve deeper genomic information was done using Sanger and 454 pyrosequencing technologies and targeted transcriptome4. Using this approach, 2 million reads from 12 cDNA libraries were attained from several olive cultivars and progenies. These libraries were done for fruit in different developmental stages, vegetative organs (stems, leaves, roots) and buds. In 20165 using a blend of fosmid and whole genome shotgun libraries and Illumina sequencing technology, sequenced the genome of a single 1200-year-old Mediterranean olive tree (Olea europaea L. subsp. europaea var. europaea cultivar ‘Farga’). This genome assembly has a total length of 1.31 Gb that correspond to 95% of the estimated 1.38 Gb Olive genome size and about 56,349 unique protein coding genes were predicted. This first assembled draft genome of Olea europaea was a milestone for the study of the evolution and domestication processes of olive and gave new insight in the genetic bases of key phenotypic traits relevant to agronomic and stress tolerance characteristics6. More recently the olive cultivar Arbequina, suitable for mechanized harvesting and dense planting, was sequenced using Oxford Nanopore third generation sequencing7. Authors assembled 1.1 Gb of sequences organized in 23 pseudochromosomes and predicted 53,518 protein-coding genes. The greater contiguity of this genome assembly allowed the identification of 202 genes part of the oleuropein biosynthesis pathway genes which is twice the number of those identified from previous genomic data. More recently8, provided a gapless genome assembly for O. europaea cultivar ‘Leccino’ exploiting that resource and transcriptomic and metabolomic data to unvail a pivotal regulatory mechanism in oil biosynthesis. This evidence once more underlines the advantages provided by high-quality genome assembly for genomics studies. The accessibility of high-quality genome sequences today provides an unprecedented opportunity to compare cultivars, enabling the identification of genetic variations underlying traits such as yield, disease resistance, and environmental adaptation. Building on this potential, we targeted the high-quality genome assembly of two important olive cultivars, ‘Leccino’ and ‘Frantoio’using PacBio HiFi long reads with 27X and 29X coverage, respectively, along with their baseline gene and repeat annotation, including transposable elements.
Methods
Sampling and genome sequencing
Young leaves were sampled from 1 year old Olea europaea L. cultivar ‘Frantoio’ and ‘Leccino’ plants (supplied from Società Pesciatina d’Olivicoltura and certified according to the Community Agricultural Conformity), preserved in liquid nitrogen and stored at −80 °C until subsequent analysis. High Molecular Weight (HMW) DNA was extracted using a modified CTAB protocol9. The quality of the extracted DNA was assessed through pulsed-field gel electrophoresis (CHEF) on 1% agarose gels to evaluate fragment size and restriction enzyme digestibility. Quantification was performed using Qubit fluorometry (Thermo Fisher Scientific, Waltham, MA). Sequencing was performed using PacBio Revio System at the Arizona Genomics Institute (Tucson, AZ), on Revio SMRT cells10. High-quality sequencing data generated, 37.5 Gbp of HiFi reads for ‘Frantoio’ and 38.7 Gbp for ‘Leccino’, with read N50 values of 5.9 kbp and 16.3 kbp, respectively. This data provided approximately 30x genome coverage for both cultivars, supporting robust genome assemblies.
Genome size and heterozygosity estimation
The K-mer analysis was performed using Jellyfish v2.3.011 setting the K-mer length at 31. The resulting K-mer frequency distributions were processed using Genomescope212. The K-mer frequency distribution showed two distinct peaks (Fig. 1), the first peaks at approximately 12x-13.3x coverage, corresponded to heterozygous regions, while the second peak, at around 24x-26x coverage, reflected homozygous regions. In ‘Frantoio’, the heterozygosity rate (ab) was estimated at 2.2%, with a duplication rate (dup) of 0.0914; in ‘Leccino’ these values were 1.85% and 0.153 respectively. These heterozygosity values are higher than those reported for Arbequina i.e., 1.09%7, suggesting greater diversity in ‘Frantoio’ and ‘Leccino’. The K-mer analysis indicated high repetitive content, with 51.6% and 52.2% of unique sequences ‘Frantoio’ and ‘Leccino’, respectively. The significant repetitive content in both cultivars’ genome, emphasize the need for PacBio HiFi sequencing for resolving complex regions, to achieve contiguous, accurate assemblies.
Olea europaea L. subsp. europaea cv. ‘Frantoio’ (a) and ‘Leccino’ (b) genome size estimation (len) using Jellyfish and GenomeScope2 k-mer30 displaying homozygosity (aa), heterozygosity (ab), mean k-mer coverage for heterozygous bases (kcov), read error rate (err), the average rate of read duplications (dup), k-mer size used on the run (k), and ploidy (p).
Genome assembly and scaffolding
PacBio HiFi reads were assembled using hifiasm/v0.19.813 with default parameters and redundant haplotigs were removed using “purge haplotigs”14. The contigs contained in the primary assembly were scaffolded using the tool RagTag/v2.1.0 as described by Alonge et al.15. For the whole-genome alignment, the built-in Minimap2 inbuilt aligner was used16. The genome assembly for ‘Frantoio’ was around 1,18 Gbp with 1,726 contigs and an N50 of 1.78 Mbp, whereas ‘Leccino’ was around 1,43 Gbp with 103 contigs and an N50 of 45.86 Mbp (Table 1)
The genome size of our genome assemblies compared well with those provided by the K-mers analysis (~1.29 Gbp for ‘Frantoio’ and ~1.31 Gbp for ‘Leccino’) and align closely to those of the previous published olive genome assemblies such as Olea europaea L. subsp. europaea cv. ‘Farga’ (~1.31 Gb)5, Arbequina (~1.3 Gb)7, and the recently released ‘Leccino’ (~1.28 Gb)8. These estimates confirm that the genome size of cultivated olive is smaller in comparison to that of wild oleaster (Olea europaea subsp. ‘sylvestris’), which is approximately 1.48 GB7.
The scaffolding of the ‘Frantoio’ assembly and ‘Leccino’assembly was carried out in a stepwise approach using multiple reference genomes to aseess the assembly quality and accuracy. These references includes the wild olive (Olea europaea subsp. sylvestris RefSeq assembly GCF_002742605.1, O_europea_v1), the cultivated olive ‘Farga’ (Olea europaea subsp. europaea genome assembly OLEA9, 2020) and the ‘Leccino’ genome (Olea europaea subsp. europaea genome assembly GCA-902713445.1) recently published8.
In the first step, Olea europaea subsp. sylvestris was used as a reference. Scaffolding was performed using Ragtag Scaffold 2.1.015. The scaffold coverage achieved was 86.02% for ‘Frantoio’, and 84.58% for ‘Leccino’ (Fig. 2). A notable difference was observed in the number of structural gaps with 605 gaps in ‘Frantoio’, and 13 gaps in ‘Leccino’.
Comparison of ‘Frantoio’ and ‘Leccino’ scaffolding using Olea europaea subsp. sylvestris as the reference genome. (a) Scaffold length vs reference length, showing scaffold sizes and gaps for each chromosome. (b) Scaffold coverage on the reference genome, illustrating the percentage of reference genome covered by scaffolds from each cultivar. Only Scaffolds >2 Mb are shown.
In the second step, the ‘Farga’ genome was used as a reference for scaffolding. Scaffold coverage decreased to 40.92% for ‘Frantoio’ and 29.5% for ‘Leccino’, both lower than the coverage achieved with Olea europaea subsp. sylvestris (Fig. 3). Structural gaps number is lower, with 504 gaps identified in ‘Frantoio’ and 13 gaps in ‘Leccino’.
Comparison of ‘Frantoio’ and ‘Leccino’ scaffolding using ‘Farga’ (Olea europaea subsp. europaea) as the reference genome. (a) Scaffold length vs reference length, comparing scaffold sizes and gaps for each chromosome. (b) Scaffold coverage on the reference genome, illustrating the percentage of reference genome covered by scaffolds from each cultivar. Only Scaffolds >2 Mb are shown.
In the final step, the ‘Frantoio’ genome assembly and ‘Leccino’ genome assembly were scaffolded using published ‘Leccino’8 as a reference. The scaffold coverage was 82.43% for ‘Frantoio’ and 93.37% for ‘Leccino’ (our data) and 1,208 structural gaps with ‘Frantoio’ and only 33 gaps with the ‘Leccino’ (this study data) (Table 2 and Fig. 4).
Comparison of ‘Frantoio’ and ‘Leccino’ (this study data) scaffolding using published ‘Leccino’ as the reference genome8. (a) Scaffold length vs reference length, comparing scaffold sizes and gaps for each chromosome. (b) Scaffold coverage on the reference genome, illustrating the percentage of reference genome covered by scaffolds from each cultivar. Only Scaffolds >2 Mb are shown.
The final Scaffold of ‘Frantoio’ displays a big gap on Chr01 (Fig. 4b), however the analyses of the alignment of the unplaced contigs over the reference shows that contig ptg000955_1 has homology for a fraction of the gap region (Fig. 5a) and thus we decided to incorporate this contig in the Chr01 scaffold. AGP file generated by ragtag was modified by inserting the contig ptg000955_1 after ptg0031901_1 on the + strand of Chr01, which shifted all the subsequent contigs of that chromosome. The edited scaffold was re-scaffolded again over the reference to verify the partial gap filling (Fig. 5b).
Benchmarking universal single copy orthologs (BUSCO)
BUSCO/v5.7.117 with eudicotyledons_odb10 database18, which comprises 1,614 orthologous genes, was used to assess the completeness of the genome assembly, calculating the percentage of single copy, duplicated, fragmented, and missing genes (Table 3).
Complete BUSCO gene copies were almost equals (Fig. 6): 1,580 for ‘Frantoio’ and 1,581 for ‘Leccino’, corresponding to approximately 97.9% of the entire BUSCO gene set.
These results are higher than the values reported in reference study5,19 with ‘Farga’ (92.99%), ‘Arbequina’ (92.87%) and Olea europaea subsp. sylvestris (85.50%) but closer (99.93%) to that obtained by Lv et al.8 in the cultivar ‘Leccino’. Single copy accounted for 83.1% (1,342 copies) in ‘Frantoio’ and 82.9% (1,338 copies) in ‘Leccino’. The number of duplicated ortholog was 14.7% (238 copies) for ‘Frantoio’ and 15.1% (243 copies) for ‘Leccino’ of the total BUSCO groups searched, respectively, which are comparable to Arbequina where the duplication rate was reported as 20.38% and in ‘Farga’ with 18.15% of duplication respectively and in good agreement with the 12.8% of duplicated genes reported in the cultivar ‘Leccino’ Lv et al.8. In contrast, the wild relative Olea europaea subsp. sylvestris exhibited a much higher duplication of 37.98%7.
The Fragmented orthologs accounted for 1.4% (22 copies) in both cultivars, while the missing copies were limited to 0.8% (12 copies) in ‘Frantoio’ and 0.6% (11 copies) in ‘Leccino’. These values are lower than those observed in other accessions; for example, 2.42% fragmentation is reported in ‘Arbequina’ and while Olea europaea subsp. sylvestris exhibited 6.69% fragmentation and 7.81% missing BUSCOs7. This likely reflects the high contiguity and accuracy achieved in these assemblies, which reduces the incidence of fragmented and missing gene copies and enhance the completeness of the genome. Altogether the BUSCO statistics testify the high completeness and contiguity achieved in our genome assemblies.
Repeats and transposable elements identification
The Extensive de novo TE annotator (EDTA)/v2.1.0 was used to generate a Transposable Element (TE) library20. All the Helitron predictions were removed from the TE library because their identification is quite imprecise and prone to generate false positives20. Following this, the RepeatMasker/v4.1.421 was run with the default settings to mask the repetitive sequences on the genome assemblies.
The TE content of the Olea europaea L. assembly was estimated (Table 4) to be 67.47% in ‘Frantoio’ and 70.84% in ‘Leccino’. These values accounted for ~769 Mb and ~1,006 Mb in ‘Frantoio’ and ‘Leccino’ respectively. This different amount of repetitive and TE related sequences explains most of the diverse genome size of the two cultivars. The TE and repetitive sequence fraction is almost in the same range of previously sequenced olive genomes, 66.30% for Olea europaea cv. ‘Leccino’8, 59% cultivated Olea europaea cv. ‘Picual’22, but significantly larger than Olea europaea subsp. sylvestris, with 51% of genome composed of repetitive DNA19.
In both cultivar the largest TE class is represented by LTR-RT accounting for 34.07% and 29.62% of the genome size in ‘Frantoio’ and in ‘Leccino’, respectively. In both cultivars Ty3-gypsy superclass always outnumber Ty1-copia one. Altogether Class 2 DNA TE represent 15.6% and 14.23% of the genome assembly size in ‘Frantoio’ and ‘Leccino’, respectively. In both case the most abundant family seems to be that of Mutator like elements totalling 8.1% in ‘Frantoio’ and 8.61% in ‘Leccino’ assembly. Several elements initially classified as Mutator DNA-TEs by EDTA exhibited characteristics inconsistent with this TE family. In particular, these elements appeared to be arranged in tandem over long stretches of DNA, as demonstrated by dot plot analysis (Fig. 7), which is not typical of Mutator DNA-TEs, as they are usually scattered throughout the genome. This study reanalyzed these sequences using BLASTN searches against the NCBI nr database and the PlantSat database23, revealing that they matched known satellite repeats and were subsequently reclassified.
This was the case of three abundant satellite sequences we identified and named Satellite_1, Satellite_2 and Satellite_3 (Table 5).
Satellite_1 is an 80 bp long minisatellite that represents approximately 8.62% and 16.90% of the genome assemblies of the ‘Frantoio’ and ‘Leccino’, respectively. This corresponds to about 102 Mbp in ‘Frantoio’ and 242 Mbp ‘Leccino’ genome assembly. The copy number of Satellite_1 is 1.28 million copies in ‘Frantoio’ and 3.03 million copies in ‘Leccino’. Satellite_2 is a tandem repeat characterized by a 141 bp long monomer which covers approximately 4.65% of the ‘Frantoio’ genome, corresponding to 55,565,954 bp, and 5.78% of the ‘Leccino’ one, corresponding to 82,789,861 bp. The amount of Satellite_2 copies is estimated to be 394,000 and 587,000 in ‘Frantoio’ and ‘Leccino’, respectively. Finally, Satellite_3 has a 107 bp long monomer. It covers 41.01 Mbp and 61.74 Mbp of ‘Frantoio’ and ‘Leccino’ genome assemblies, respectively. The estimated copy number is ~383,000 in ‘Frantoio’ and ~604,000 in ‘Leccino’ (Table 4). This evidence confirms the finding that a large portion of the olive genome is composed by few different families of tandemly arranged repeats24.
Gene prediction and functional annotation
The gene prediction was carried out using the tool Augustus25 as implemented in the suite Omicsbox26. The genome assemblies soft masked for repeats were used as the primary input and the gene structure devised in Arabidopsis thaliana was used as a model. The Functional annotation was done comparing the predicted genes to the nr division of GeneBank using Diamond blast /v2.1.827 and analyzing them with InterProScan/v5.61–93.028. The parameters used for Diamond blast were Blast Mode = blastp, Sensitivity Mode = Standard, Database = NR, Blast e-value = 1.0E−3. For InterProScan we used the Cloud InterProScan with up to date EMBL-EBI-InterPro data including CCD, HHM-Pfam and HMMPIR models.
Gene prediction found 59,777 genes in ‘Frantoio’. Out of the 47,201 of them considered highly reliable (posterior probability > 0.4) 37,061 were successfully functionally annotated. In ‘Leccino’, 67,103 genes were found and out of the 53,302 with high quality (posterior probability > 0.4) 37,606 were functionally annotated (Table 6).
The discrepancy in the number of genes predicted in the two cultivars could likely reflect the greater fragmentation of the cultivar ‘Frantoio’ genome assembly. These values are comparable with those of Cruz et al.5, where the Authors found a set of 56,349 protein coding genes, with 89,982 transcripts encoding 79,910 unique protein products and those of the recently published cultivar ‘Leccino’ genome assembly in which 70,138 protein encoding genes were identified8. In Arbequina7 genome assembly, the predicted protein-coding genes were 53,518 and Authors successfully annotated 50,969 genes using GO, Kyoto Encyclopedia of Genes and Genomes (KEGG), Eukaryotic Orthologous Groups (KOG), TrEMBL, and Nonredundant (Nr) databases. Overall, gene predictions and annotations showed that in Olea europaea L. subsp. europaea the protein-coding genes number range from 67,103 to 53,518, with the higher values in ‘Leccino’ and the lowest values in ‘Arbequina’.
Structural variant analysis
The ‘Frantoio’ genome assembly was aligned to the ‘Leccino’ assembly using minimap2/v2.24 216 under default parameters. The resulting sam file was sorted and indexed using samtools/v1.16.129, subsequently, the SVIM-asm tool was used to detect SVs, including insertion, inversion, deletion, duplication, interspersed duplication and tandem duplication30.
The total number of deletions, insertions and inversions in Olea europaea genome of cultivar ‘Leccino’ in comparison to cultivar ‘Frantoio’ are presented in Fig. 8. Deletions and insertions are very similar in number (22,469 deletions and 21,218 insertions), while the number of inversions is very low (n = 33). More than 85% of the insertions and deletions exhibit high similarity to TEs or other repetitive sequences. The enrichment of TEs in structural variants, compared to the entire genome, highlights the contribution of these elements to genome variation. According to data the two cultivars ‘Frantoio’ and ‘Leccino’ exhibit a considerable amount of genetic variation, consistent with previous findings obtained using Simple Sequence Repeats (SSRs) markers31.
In conclusion, the assembly of ‘Frantoio’ and ‘Leccino’ provided in this study could be a valuable dataset for studying cultivar differentiation in traits that are useful for olive genetic improvement. Together with others published genomes, these data will enable further understanding of olive evolution and domestication processes.
Data Records
All the raw sequencing data and the genome assemblies have been submitted to NCBI.
PcBio HiFi reads of Olea europaea cv. ‘Leccino’ NCBI Sequence Read Archive under SRP55122932.
Olea europaea cv. ‘Leccino’ genome assembly data on NCBI GenBank under GCA_048165045.133.
PcBio HiFi reads of Olea europaea cv. ‘Frantoio’ NCBI Sequence Read Archive under SRP55122634.
Olea europaea cv. ‘Frantoio’ genome assembly data on NCBI GenBank GCA_048169195.135.
The variant data for this study have been deposited in the European Variation Archive (EVA) at EMBL-EBI under accession number ERP17205036.
Technical Validation
The quality and concentration of extracted DNA were assessed using NanoDrop Spectrophotometer and FEMTO size profile before the genome sequencing.
After the genome assembly was completed, the assembly results were evaluated:
-
i).
The HiFi reads used for genome assemblies were mapped back onto to the assembled genome. The alignment rate of reads was in both cases higher than 99%, showing high consistency between the reads and assembled genomes.
-
ii).
the completeness of the genome assemblies was evaluated using the BUSCOs eudicotyledons_odb10 database.
Code availability
The manuscript did not use custom code to generate or process the data described except for final scaffolding of ‘Frantoio’ and ‘Leccino’ as a reference, the AGP file generate by ragtag was modified by inserting the contig ptg000955_1 after ptg0031901_1 on the + strand of Chr01, which shifted all the subsequent contigs of that chromosome. The final scaffold was then constructed using RAGTAG’s agp2fasta tool, Software and pipelines were executed according to the manual and protocols. Figures 2a,3a,4a were created by plotting the scaffold lengths and gap position from the respective ragtag.scaffold.agp files versus the reference chromosome length. Figures 2b,3b,4b,5b were created by plotting the alignment segments position of each scaffold from the respective ragtag.scaffold.asm.paf files. Similarly, Fig. 5a was created by plotting the alignment segments position of the 20 longest unplaced contigs >100 Kb from the respective ragtag.scaffold.asm.paf files. Complete code is available at https://github.com/mirkocelii/GS-viewer.
References
Kaniewski, D. et al. Primary domestication and early uses of the emblematic olive tree: palaeobotanical, historical and molecular evidences from the Middle East. Biological Reviews 87, 885–899 (2012).
Besnard, G., Terral, J. F. & Cornille, A. On the origins and domestication of the olive: a review and perspectives. Annals of Botany 121, 385–403 (2018).
Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics 13, 278–289 (2015).
Muñoz-Mérida, A. et al. De Novo Assembly and Functional Annotation of the Olive (Olea europaea) Transcriptome. DNA Research 20, 93–108 (2013).
Cruz, F. et al. Genome sequence of the olive tree, Olea europaea. GigaScience 5, 29 (2016).
Sebastiani, L. & Gucci, R. in: The Olive: Botany and Production (ed. Fabbri, A, Baldoni, Caruso T., Famiani F.) Ch 7.1 (CABI Digital Library, 2023).
Rao, G. et al. De novo assembly of a new Olea europaea genome accession using nanopore sequencing. Horticulture Research 8, 64 (2021).
Lv, J. et al. The gapless genome assembly and multi-omics analyses unveil a pivotal regulatory mechanism of oil biosynthesis in the olive tree. Horticulture Research 11, uhae168 (2024).
Porebski, S., Bailey, L. G. & Baum, B. R. Modification of a CTAB DNA extraction protocol for plants containing high polysaccharide and polyphenol components. Plant Molecular Biology Reporter 15, 8–15 (1997).
PacBio Revio [WWW Document] URL https://genohub.com/ngs-sequencer/4/pacbio-revio/ (accessed 11.14.24) (2024).
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communication 11, 1432 (2020).
Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nature Biotechnology 40, 1332–1335 (2022).
Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics 19, 460 (2018).
Alonge, M. et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biology 23, 258 (2022).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094 (2018).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Unver, T. et al. Genome of wild olive and the evolution of oil biosynthesis. The Proceedings of the National Academy of Sciences of the United States of America 114, E9413–E9422 (2017).
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biology 20, 275 (2019).
Tempel, S. In: Mobile Genetic Elements: Protocols and Genomic Applications. (ed. Bigot, Y.) Ch 2 (Humana Press, Totowa, NJ, 2012).
Jiménez-Ruiz, J. et al. Transposon activation is a major driver in the genome evolution of cultivated olive trees (Olea europaea L.). Plant Genome 13, e20010 (2020).
Macas, J., Mészáros, T. & Nouzová, M. PlantSat: a specialized database for plant satellite repeats. Bioinformatics 18, 28–35 (2002).
Barghini, E. et al. The peculiar landscape of repetitive sequences in the olive (Olea europaea L.) genome. Genome Biology and Evolution 6, 776–791 (2014).
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Research 34, W435–W439 (2006).
OmicsBox – Bioinformatics Made Easy, BioBam Bioinformatics (2019).
Buchfink, B., Reuter, K. & Drost, H. G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods 18, 366–368 (2021).
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Danecek, P. et al. Twelve years of SAMtools and BCFtolls. GigaScience 10, giab008 (2021).
Heller, D. & Vingron, M. SVIM-asm: structural variant detection from haploid and diploid genome assemblies. Bioinformatics 26, 22–23 (2020).
Bracci, T. et al. SSR markers reveal the uniqueness of olive cultivars from the Italian region of Liguria. Scientia Horticulturae 122, 209–215 (2009).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP551229 (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_048165045.1 (2025).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP551226 (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_048169195.1 (2025).
EMBL-EBI European Variation Archive https://identifiers.org/ena.embl:ERP172050 (2025).
Wicker, T. et al. A unified classification system for eukaryotic transposable elements. Nature reviews genetics 8, 973–982 (2007).
Acknowledgements
This study was carried out within the Agritech National Research Center and received funding from the European Union Next-GenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR)–MISSIONE 4 COMPONENTE 2, INVESTIMENTO 14–DD 1032 17/06/2022, CN00000022). This manuscript reflects only the authors’ views and opinions, neither the European Union nor the European Commission can be considered responsible for them. Iqra Sarfraz Agrobioscience PhD Scholarship was funded by Programma Operativo Nazionale Ricerca e Innovazione 2014–2020 (CCI 2014IT16M2OP005), FSE REACT-EU, Azione IV.4 “Dottorati e contratti di ricerca su tematiche dell’innovazione” e Azione IV.5 “Dottorati su tematiche Green”.
Author information
Authors and Affiliations
Contributions
L.S: Conceptualization, Methodology, Resources, Formal analysis, Data curation, Software, Writing-original draft, Project administration, Funding acquisition, Writing- review & editing. I.S.: Methodology, Investigation, Validation, Formal analysis, Data curation, Software, Visualization, Writing-original draft, Writing- review & editing. A.Z.: Conceptualization, Methodology, Validation, Resources, Formal analysis, Data curation, Software, Visualization, Funding acquisition, Writing- review & editing. A.F.: Conceptualization, Writing- review & editing. M.C.: Methodology, Validation, Data curation, Software, Visualization, Writing- review & editing. R.A.W.: Methodology, Resources, Writing- review & editing.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Sebastiani, L., Sarfraz, I., Francini, A. et al. PacBio genome assembly of Olea europaea L. subsp. europaea cultivars Frantoio and Leccino. Sci Data 12, 1095 (2025). https://doi.org/10.1038/s41597-025-05363-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-025-05363-4