Background & Summary

Olive (Olea europaea L. subsp. europaea) is an economically relevant and widely distributed fruit crop in the Mediterranean Basin. This iconic tree has its origins linked to the beginning of ancient Mediterranean civilizations dated more than six millennia ago1 and its domestication is still debated in scientific literature2. Thanks to the healthy and organoleptic properties of the extra virgin olive oil, in the last decades olive orchards spread in many other warm-temperate regions of the world such as North and South America, Australia, New Zealand, and South Africa, and even in the monsoon areas likes China and India.

Despite the environmental, cultural, economic and scientific value of this specie, the availability of a high-quality genome data for relevant cultivars are still scarce. Additionally, the selection of olive varieties using traditional breeding practices is a time consuming and largely random process, since precise molecular information on genes ___location and structure are largely missing. Today, third generation sequencing techniques3 generating very long reads allow the high-quality complete assembly of complex genomes. These genome assemblies in turn enable a deeper understanding of genome structure and provide foundational data sets for functional genomics, genetic engineering and molecular breeding. All these aspects are particularly relevant in woody crops, like olive, where genome data are missing, scarce, or when present are often characterised by lower quality in comparison to other plants/crops. Additionally, olive has a complex mid-size genome characterized by high heterozygosity and high repeat content.

In olive, the first research aimed to achieve deeper genomic information was done using Sanger and 454 pyrosequencing technologies and targeted transcriptome4. Using this approach, 2 million reads from 12 cDNA libraries were attained from several olive cultivars and progenies. These libraries were done for fruit in different developmental stages, vegetative organs (stems, leaves, roots) and buds. In 20165 using a blend of fosmid and whole genome shotgun libraries and Illumina sequencing technology, sequenced the genome of a single 1200-year-old Mediterranean olive tree (Olea europaea L. subsp. europaea var. europaea cultivar ‘Farga’). This genome assembly has a total length of 1.31 Gb that correspond to 95% of the estimated 1.38 Gb Olive genome size and about 56,349 unique protein coding genes were predicted. This first assembled draft genome of Olea europaea was a milestone for the study of the evolution and domestication processes of olive and gave new insight in the genetic bases of key phenotypic traits relevant to agronomic and stress tolerance characteristics6. More recently the olive cultivar Arbequina, suitable for mechanized harvesting and dense planting, was sequenced using Oxford Nanopore third generation sequencing7. Authors assembled 1.1 Gb of sequences organized in 23 pseudochromosomes and predicted 53,518 protein-coding genes. The greater contiguity of this genome assembly allowed the identification of 202 genes part of the oleuropein biosynthesis pathway genes which is twice the number of those identified from previous genomic data. More recently8, provided a gapless genome assembly for O. europaea cultivar ‘Leccino’ exploiting that resource and transcriptomic and metabolomic data to unvail a pivotal regulatory mechanism in oil biosynthesis. This evidence once more underlines the advantages provided by high-quality genome assembly for genomics studies. The accessibility of high-quality genome sequences today provides an unprecedented opportunity to compare cultivars, enabling the identification of genetic variations underlying traits such as yield, disease resistance, and environmental adaptation. Building on this potential, we targeted the high-quality genome assembly of two important olive cultivars, ‘Leccino’ and ‘Frantoio’using PacBio HiFi long reads with 27X and 29X coverage, respectively, along with their baseline gene and repeat annotation, including transposable elements.

Methods

Sampling and genome sequencing

Young leaves were sampled from 1 year old Olea europaea L. cultivar ‘Frantoio’ and ‘Leccino’ plants (supplied from Società Pesciatina d’Olivicoltura and certified according to the Community Agricultural Conformity), preserved in liquid nitrogen and stored at −80 °C until subsequent analysis. High Molecular Weight (HMW) DNA was extracted using a modified CTAB protocol9. The quality of the extracted DNA was assessed through pulsed-field gel electrophoresis (CHEF) on 1% agarose gels to evaluate fragment size and restriction enzyme digestibility. Quantification was performed using Qubit fluorometry (Thermo Fisher Scientific, Waltham, MA). Sequencing was performed using PacBio Revio System at the Arizona Genomics Institute (Tucson, AZ), on Revio SMRT cells10. High-quality sequencing data generated, 37.5 Gbp of HiFi reads for ‘Frantoio’ and 38.7 Gbp for ‘Leccino’, with read N50 values of 5.9 kbp and 16.3 kbp, respectively. This data provided approximately 30x genome coverage for both cultivars, supporting robust genome assemblies.

Genome size and heterozygosity estimation

The K-mer analysis was performed using Jellyfish v2.3.011 setting the K-mer length at 31. The resulting K-mer frequency distributions were processed using Genomescope212. The K-mer frequency distribution showed two distinct peaks (Fig. 1), the first peaks at approximately 12x-13.3x coverage, corresponded to heterozygous regions, while the second peak, at around 24x-26x coverage, reflected homozygous regions. In ‘Frantoio’, the heterozygosity rate (ab) was estimated at 2.2%, with a duplication rate (dup) of 0.0914; in ‘Leccino’ these values were 1.85% and 0.153 respectively. These heterozygosity values are higher than those reported for Arbequina i.e., 1.09%7, suggesting greater diversity in ‘Frantoio’ and ‘Leccino’. The K-mer analysis indicated high repetitive content, with 51.6% and 52.2% of unique sequences ‘Frantoio’ and ‘Leccino’, respectively. The significant repetitive content in both cultivars’ genome, emphasize the need for PacBio HiFi sequencing for resolving complex regions, to achieve contiguous, accurate assemblies.

Fig. 1
figure 1

Olea europaea L. subsp. europaea cv. ‘Frantoio’ (a) and ‘Leccino’ (b) genome size estimation (len) using Jellyfish and GenomeScope2 k-mer30 displaying homozygosity (aa), heterozygosity (ab), mean k-mer coverage for heterozygous bases (kcov), read error rate (err), the average rate of read duplications (dup), k-mer size used on the run (k), and ploidy (p).

Genome assembly and scaffolding

PacBio HiFi reads were assembled using hifiasm/v0.19.813 with default parameters and redundant haplotigs were removed using “purge haplotigs”14. The contigs contained in the primary assembly were scaffolded using the tool RagTag/v2.1.0 as described by Alonge et al.15. For the whole-genome alignment, the built-in Minimap2 inbuilt aligner was used16. The genome assembly for ‘Frantoio’ was around 1,18 Gbp with 1,726 contigs and an N50 of 1.78 Mbp, whereas ‘Leccino’ was around 1,43 Gbp with 103 contigs and an N50 of 45.86 Mbp (Table 1)

Table 1 Main statistics of draft genome assembly for Olea europaea cultivars ‘Frantoio’ and ‘Leccino’.

The genome size of our genome assemblies compared well with those provided by the K-mers analysis (~1.29 Gbp for ‘Frantoio’ and ~1.31 Gbp for ‘Leccino’) and align closely to those of the previous published olive genome assemblies such as Olea europaea L. subsp. europaea cv. Farga’ (~1.31 Gb)5, Arbequina (~1.3 Gb)7, and the recently released ‘Leccino’ (~1.28 Gb)8. These estimates confirm that the genome size of cultivated olive is smaller in comparison to that of wild oleaster (Olea europaea subsp. ‘sylvestris’), which is approximately 1.48 GB7.

The scaffolding of the ‘Frantoio’ assembly and ‘Leccino’assembly was carried out in a stepwise approach using multiple reference genomes to aseess the assembly quality and accuracy. These references includes the wild olive (Olea europaea subsp. sylvestris RefSeq assembly GCF_002742605.1, O_europea_v1), the cultivated olive ‘Farga’ (Olea europaea subsp. europaea genome assembly OLEA9, 2020) and the ‘Leccino’ genome (Olea europaea subsp. europaea genome assembly GCA-902713445.1) recently published8.

In the first step, Olea europaea subsp. sylvestris was used as a reference. Scaffolding was performed using Ragtag Scaffold 2.1.015. The scaffold coverage achieved was 86.02% for ‘Frantoio’, and 84.58% for ‘Leccino’ (Fig. 2). A notable difference was observed in the number of structural gaps with 605 gaps in ‘Frantoio’, and 13 gaps in ‘Leccino’.

Fig. 2
figure 2

Comparison of Frantoio’ and ‘Leccino’ scaffolding using Olea europaea subsp. sylvestris as the reference genome. (a) Scaffold length vs reference length, showing scaffold sizes and gaps for each chromosome. (b) Scaffold coverage on the reference genome, illustrating the percentage of reference genome covered by scaffolds from each cultivar. Only Scaffolds >2 Mb are shown.

In the second step, the ‘Farga’ genome was used as a reference for scaffolding. Scaffold coverage decreased to 40.92% for ‘Frantoio’ and 29.5% for ‘Leccino’, both lower than the coverage achieved with Olea europaea subsp. sylvestris (Fig. 3). Structural gaps number is lower, with 504 gaps identified in ‘Frantoio’ and 13 gaps in ‘Leccino’.

Fig. 3
figure 3

Comparison of Frantoio’ and ‘Leccino’ scaffolding using ‘Farga’ (Olea europaea subsp. europaea) as the reference genome. (a) Scaffold length vs reference length, comparing scaffold sizes and gaps for each chromosome. (b) Scaffold coverage on the reference genome, illustrating the percentage of reference genome covered by scaffolds from each cultivar. Only Scaffolds >2 Mb are shown.

In the final step, the ‘Frantoio’ genome assembly and ‘Leccino’ genome assembly were scaffolded using published ‘Leccino’8 as a reference. The scaffold coverage was 82.43% for ‘Frantoio’ and 93.37% for ‘Leccino’ (our data) and 1,208 structural gaps with ‘Frantoio’ and only 33 gaps with the ‘Leccino’ (this study data) (Table 2 and Fig. 4).

Table 2 Scaffolding results of ‘Frantoio’ and ‘Leccino’ assemblies using ‘Farga’, Olea europaea subsp. sylvestris and ‘Leccino’ published as a reference.
Fig. 4
figure 4

Comparison of Frantoio’ and ‘Leccino’ (this study data) scaffolding using published ‘Leccino’ as the reference genome8. (a) Scaffold length vs reference length, comparing scaffold sizes and gaps for each chromosome. (b) Scaffold coverage on the reference genome, illustrating the percentage of reference genome covered by scaffolds from each cultivar. Only Scaffolds >2 Mb are shown.

The final Scaffold of ‘Frantoio’ displays a big gap on Chr01 (Fig. 4b), however the analyses of the alignment of the unplaced contigs over the reference shows that contig ptg000955_1 has homology for a fraction of the gap region (Fig. 5a) and thus we decided to incorporate this contig in the Chr01 scaffold. AGP file generated by ragtag was modified by inserting the contig ptg000955_1 after ptg0031901_1 on the + strand of Chr01, which shifted all the subsequent contigs of that chromosome. The edited scaffold was re-scaffolded again over the reference to verify the partial gap filling (Fig. 5b).

Fig. 5
figure 5

(a) Alignment of Frantoio’ unplaced contigs to the published ‘Leccino’8 reference genome (b) Comparison of the Frantoio’ Scaffolds with published ‘Leccino’ reference8 before (red) and after (blue) agp file editing to validate the correct positioning of ptg000955_1 on Chr01.

Benchmarking universal single copy orthologs (BUSCO)

BUSCO/v5.7.117 with eudicotyledons_odb10 database18, which comprises 1,614 orthologous genes, was used to assess the completeness of the genome assembly, calculating the percentage of single copy, duplicated, fragmented, and missing genes (Table 3).

Table 3 BUSCO main statistics as percentage (%) of the total (n = 1,614) BUSCO groups searched in Olea europaea L. genome cultivars ‘Frantoio’ and ‘Leccino’.

Complete BUSCO gene copies were almost equals (Fig. 6): 1,580 for ‘Frantoio’ and 1,581 for ‘Leccino’, corresponding to approximately 97.9% of the entire BUSCO gene set.

Fig. 6
figure 6

Comparison of genome completeness among different Olea europaea subsp. using BUSCO assessment. The figure demonstrates the percentage of complete (C), single (S), duplicated (D), fragmented (F) and missing (M) BUSCO orthologs identified in different genome assemblies. Total BUSCO genes n = 1,614.

These results are higher than the values reported in reference study5,19 with ‘Farga’ (92.99%), ‘Arbequina’ (92.87%) and Olea europaea subsp. sylvestris (85.50%) but closer (99.93%) to that obtained by Lv et al.8 in the cultivar ‘Leccino’. Single copy accounted for 83.1% (1,342 copies) in ‘Frantoio’ and 82.9% (1,338 copies) in ‘Leccino’. The number of duplicated ortholog was 14.7% (238 copies) for ‘Frantoio’ and 15.1% (243 copies) for ‘Leccino’ of the total BUSCO groups searched, respectively, which are comparable to Arbequina where the duplication rate was reported as 20.38% and in ‘Farga’ with 18.15% of duplication respectively and in good agreement with the 12.8% of duplicated genes reported in the cultivar ‘Leccino’ Lv et al.8. In contrast, the wild relative Olea europaea subsp. sylvestris exhibited a much higher duplication of 37.98%7.

The Fragmented orthologs accounted for 1.4% (22 copies) in both cultivars, while the missing copies were limited to 0.8% (12 copies) in ‘Frantoio’ and 0.6% (11 copies) in ‘Leccino’. These values are lower than those observed in other accessions; for example, 2.42% fragmentation is reported in ‘Arbequina’ and while Olea europaea subsp. sylvestris exhibited 6.69% fragmentation and 7.81% missing BUSCOs7. This likely reflects the high contiguity and accuracy achieved in these assemblies, which reduces the incidence of fragmented and missing gene copies and enhance the completeness of the genome. Altogether the BUSCO statistics testify the high completeness and contiguity achieved in our genome assemblies.

Repeats and transposable elements identification

The Extensive de novo TE annotator (EDTA)/v2.1.0 was used to generate a Transposable Element (TE) library20. All the Helitron predictions were removed from the TE library because their identification is quite imprecise and prone to generate false positives20. Following this, the RepeatMasker/v4.1.421 was run with the default settings to mask the repetitive sequences on the genome assemblies.

The TE content of the Olea europaea L. assembly was estimated (Table 4) to be 67.47% in ‘Frantoio’ and 70.84% in ‘Leccino’. These values accounted for ~769 Mb and ~1,006 Mb in ‘Frantoio’ and ‘Leccino’ respectively. This different amount of repetitive and TE related sequences explains most of the diverse genome size of the two cultivars. The TE and repetitive sequence fraction is almost in the same range of previously sequenced olive genomes, 66.30% for Olea europaea cv. ‘Leccino’8, 59% cultivated Olea europaea cv. ‘Picual’22, but significantly larger than Olea europaea subsp. sylvestris, with 51% of genome composed of repetitive DNA19.

Table 4 Transposable elements classification according to Wicker et al.37 in Olea europaea cultivars ‘Frantoio’ and ‘Leccino’.

In both cultivar the largest TE class is represented by LTR-RT accounting for 34.07% and 29.62% of the genome size in ‘Frantoio’ and in ‘Leccino’, respectively. In both cultivars Ty3-gypsy superclass always outnumber Ty1-copia one. Altogether Class 2 DNA TE represent 15.6% and 14.23% of the genome assembly size in ‘Frantoio’ and ‘Leccino’, respectively. In both case the most abundant family seems to be that of Mutator like elements totalling 8.1% in ‘Frantoio’ and 8.61% in ‘Leccino’ assembly. Several elements initially classified as Mutator DNA-TEs by EDTA exhibited characteristics inconsistent with this TE family. In particular, these elements appeared to be arranged in tandem over long stretches of DNA, as demonstrated by dot plot analysis (Fig. 7), which is not typical of Mutator DNA-TEs, as they are usually scattered throughout the genome. This study reanalyzed these sequences using BLASTN searches against the NCBI nr database and the PlantSat database23, revealing that they matched known satellite repeats and were subsequently reclassified.

Fig. 7
figure 7

Dotplot self-comparison of a genomic tract composed by a multiple tandemly arranged copies of Sat1. (a) 40 kbp long region; (b) detailed view of the region included in the white circle in (a).

This was the case of three abundant satellite sequences we identified and named Satellite_1, Satellite_2 and Satellite_3 (Table 5).

Table 5 The sequences of the three most abundant satellite repeats identified in ‘Frantoio’ and ‘Leccino’ genomes.

Satellite_1 is an 80 bp long minisatellite that represents approximately 8.62% and 16.90% of the genome assemblies of the ‘Frantoio’ and ‘Leccino’, respectively. This corresponds to about 102 Mbp in ‘Frantoio’ and 242 Mbp ‘Leccino’ genome assembly. The copy number of Satellite_1 is 1.28 million copies in ‘Frantoio’ and 3.03 million copies in ‘Leccino’. Satellite_2 is a tandem repeat characterized by a 141 bp long monomer which covers approximately 4.65% of the ‘Frantoio’ genome, corresponding to 55,565,954 bp, and 5.78% of the ‘Leccino’ one, corresponding to 82,789,861 bp. The amount of Satellite_2 copies is estimated to be 394,000 and 587,000 in ‘Frantoio’ and ‘Leccino’, respectively. Finally, Satellite_3 has a 107 bp long monomer. It covers 41.01 Mbp and 61.74 Mbp of ‘Frantoio’ and ‘Leccino’ genome assemblies, respectively. The estimated copy number is ~383,000 in ‘Frantoio’ and ~604,000 in ‘Leccino’ (Table 4). This evidence confirms the finding that a large portion of the olive genome is composed by few different families of tandemly arranged repeats24.

Gene prediction and functional annotation

The gene prediction was carried out using the tool Augustus25 as implemented in the suite Omicsbox26. The genome assemblies soft masked for repeats were used as the primary input and the gene structure devised in Arabidopsis thaliana was used as a model. The Functional annotation was done comparing the predicted genes to the nr division of GeneBank using Diamond blast /v2.1.827 and analyzing them with InterProScan/v5.61–93.028. The parameters used for Diamond blast were Blast Mode = blastp, Sensitivity Mode = Standard, Database = NR, Blast e-value = 1.0E−3. For InterProScan we used the Cloud InterProScan with up to date EMBL-EBI-InterPro data including CCD, HHM-Pfam and HMMPIR models.

Gene prediction found 59,777 genes in ‘Frantoio’. Out of the 47,201 of them considered highly reliable (posterior probability > 0.4) 37,061 were successfully functionally annotated. In ‘Leccino’, 67,103 genes were found and out of the 53,302 with high quality (posterior probability > 0.4) 37,606 were functionally annotated (Table 6).

Table 6 Functional annotation of predicted genes for ‘Frantoio’ and ‘Leccino’.

The discrepancy in the number of genes predicted in the two cultivars could likely reflect the greater fragmentation of the cultivar ‘Frantoio’ genome assembly. These values are comparable with those of Cruz et al.5, where the Authors found a set of 56,349 protein coding genes, with 89,982 transcripts encoding 79,910 unique protein products and those of the recently published cultivar ‘Leccino’ genome assembly in which 70,138 protein encoding genes were identified8. In Arbequina7 genome assembly, the predicted protein-coding genes were 53,518 and Authors successfully annotated 50,969 genes using GO, Kyoto Encyclopedia of Genes and Genomes (KEGG), Eukaryotic Orthologous Groups (KOG), TrEMBL, and Nonredundant (Nr) databases. Overall, gene predictions and annotations showed that in Olea europaea L. subsp. europaea the protein-coding genes number range from 67,103 to 53,518, with the higher values in ‘Leccino’ and the lowest values in ‘Arbequina’.

Structural variant analysis

The ‘Frantoio’ genome assembly was aligned to the ‘Leccino’ assembly using minimap2/v2.24 216 under default parameters. The resulting sam file was sorted and indexed using samtools/v1.16.129, subsequently, the SVIM-asm tool was used to detect SVs, including insertion, inversion, deletion, duplication, interspersed duplication and tandem duplication30.

The total number of deletions, insertions and inversions in Olea europaea genome of cultivar ‘Leccino’ in comparison to cultivar ‘Frantoio’ are presented in Fig. 8. Deletions and insertions are very similar in number (22,469 deletions and 21,218 insertions), while the number of inversions is very low (n = 33). More than 85% of the insertions and deletions exhibit high similarity to TEs or other repetitive sequences. The enrichment of TEs in structural variants, compared to the entire genome, highlights the contribution of these elements to genome variation. According to data the two cultivars ‘Frantoio’ and ‘Leccino’ exhibit a considerable amount of genetic variation, consistent with previous findings obtained using Simple Sequence Repeats (SSRs) markers31.

Fig. 8
figure 8

Number of deletions, insertions and inversions in Olea europaea genome of cultivar ‘Leccino’ (reference = Ref.) in comparison to cultivar ‘Frantoio’.

In conclusion, the assembly of ‘Frantoio’ and ‘Leccino’ provided in this study could be a valuable dataset for studying cultivar differentiation in traits that are useful for olive genetic improvement. Together with others published genomes, these data will enable further understanding of olive evolution and domestication processes.

Data Records

All the raw sequencing data and the genome assemblies have been submitted to NCBI.

PcBio HiFi reads of Olea europaea cv. ‘Leccino’ NCBI Sequence Read Archive under SRP55122932.

Olea europaea cv. ‘Leccino’ genome assembly data on NCBI GenBank under GCA_048165045.133.

PcBio HiFi reads of Olea europaea cv. ‘Frantoio’ NCBI Sequence Read Archive under SRP55122634.

Olea europaea cv. ‘Frantoio’ genome assembly data on NCBI GenBank GCA_048169195.135.

The variant data for this study have been deposited in the European Variation Archive (EVA) at EMBL-EBI under accession number ERP17205036.

Technical Validation

The quality and concentration of extracted DNA were assessed using NanoDrop Spectrophotometer and FEMTO size profile before the genome sequencing.

After the genome assembly was completed, the assembly results were evaluated:

  1. i).

    The HiFi reads used for genome assemblies were mapped back onto to the assembled genome. The alignment rate of reads was in both cases higher than 99%, showing high consistency between the reads and assembled genomes.

  2. ii).

    the completeness of the genome assemblies was evaluated using the BUSCOs eudicotyledons_odb10 database.