Introduction

Since the beginning of genetic engineering, scientists have aimed to target heterologous DNA to genomic regions that allow transcription without perturbing endogenous gene expression. The emergence of synthetic biology has further driven the development of strains that can express polygenic traits of interest, often needing the insertion of complete metabolic pathways and genetic circuits. Consequently, researchers seek to preferentially direct foreign DNA to landing pads known as Genomic Safe Harbors1 (GSHs), Genomic Safe Havens2, or Neutral Sites3. GSHs are intragenic or intergenic regions expected to accommodate and transcribe foreign DNA inserts with no or minimal perturbation on the general gene expression levels of the host genome1. The concept is that GSHs provide protected locations for inserting DNA, reducing unpredictable phenotypes.

The first studies identifying GSHs arose for higher eukaryote genomes, traditionally relying on empirical methods such as viral insertion sites4, analysis of gene function loss5, or gene orthology6. Notably, many GSHs are in intragenic loci and genomic regions of high gene density, often near oncogenes in the context of the human genome1, raising concerns about their impact on neighboring gene expression. With the advent of gene editing tools such as zinc-finger nucleases7 and CRISPR-Cas systems8,9 or DNA integrases10, which enable biallelic transgene insertion, the risk of functional knockouts is further aggravated.

Several studies have proposed criteria to characterize and systematically identify GSHs in human cells1, rice11, Schistosoma mansoni12, Cryptococcus neoformans2, Saccharomyces cerevisiae13, Komagataella phaffii (Pichia pastoris)14, and Aspergillus fumigatus15. Despite their increased importance, all sites currently used as GSHs have yet to be entirely validated16, and this identification is not standardized. The orientation of neighboring genes, the length of intergenic regions, and off-target insertions have not been verified, and validation is limited to cell growth and indirect expression of the insert, with no assessment of interference in genomic expression.

These concerns challenge the development of new rational, standardized, and systematic strategies, leading to the emergence of bioinformatics programs to meet these demands. However, fewer than a handful are available to identify these sites, and two are restricted to the human genome17,18, with the vast majority of the identified GSH candidates not fully experimentally validated17,18,19. A common concern is that none of the cited works assessed the number of copies of the reporter gene in the genome and the possibility of off-target insertions. Thus, any expression characterization may come from insertions in other locations. They also showed that heterologous DNA insertions lead to some genomic expression perturbation17,18. Furthermore, the definition of GSH criteria is challenging when considering eukaryotic organisms due to their genomic differences, making it difficult to develop generic programs for predicting GSHs. This challenge is further compounded by the potential variation in GSH characteristics among distantly related species.

To address the need to expand GSHs to a broad range of organisms, the Safe Harbor Identification Program (SHIP) provides a graphical user interface that enables the prediction of GSH candidate regions in any annotated eukaryotic genome by incorporating defined criteria for refining the search for new safe integration sites. SHIP brings unique advantages by identifying all GSHs in intergenic regions, while allowing users to select the intergenic length range and neighboring genes orientation. To assist this selection, the software displays the distribution of intergenic regions throughout the genome. Users have the flexibility to choose the genetic parts and regulatory characteristics of interest. The output includes a comprehensive report detailing the regulatory characteristics of each GSH candidate, assisting the researcher in selecting the most suitable candidates for their studies. As proof of concept, we validated the GSHs identified by SHIP in the chassis organism S. cerevisiae through reporter gene expression evaluation (Fig. 1A).

Results

SHIP identifies putative GSHs in eukaryote genomes using general feature data

This work evaluated three eukaryotic model organisms: S. cerevisiae (R64), Homo sapiens (GRCh38), and Mus musculus (GRCm38). As canonical criteria, all identifiable GSH regions are located in intergenic regions. Other criteria, such as region length and neighborhood gene orientation, may be tailored to fit the specifications of the target organism genome (Fig. 1B). Afterwards, the program applies several qualitative filters based on the length of intergenic regions and the orientation of the flanking genes. Using annotations from external databases (UCSC), it selects intergenic regions that do not overlap with annotated features such as genes, regulatory regions or transposable elements, and telomeres or centromeres (see the Materials and methods section). The output is ordered by the genomic coordinate of putative GSH (pGSH) and is presented without any particular ranking. Notably, the convergent orientation of the neighboring genes is strongly recommended to reduce possible interference in the flanking promoter regions once the majority of regulator factors are upstream sequences20,21. As eukaryotic genomes present a wide range of sizes and gene distributions, GSH length is expected to change with the target organism. Just as the insertion of transgenes within very short GSHs could lead to neighborhood gene interference, very long GSHs could contain an uncharacterized gene or other genetic parts, possibly leading to genomic expression perturbation. To minimize these unwanted events, a GSH length range was established depending on the organism’s average gene length distribution2.

Fig. 1
figure 1

SHIP identification of putative genomic safe harbors (pGSHs). (A) Strategy overview of the eukaryotic genomic Safe Harbor Identification Program. The first step involved the design of the Safe Harbor Identification Program (SHIP), an algorithm for searching genomic safe harbors in eukaryotes from general feature data (Design); S. cerevisiae was chosen for the in vivo validation of SHIP, resulting in its genetic transformation of two overlapping BioBricks composed of a reporter gene and an auxotrophic marker (Test); followed by the analysis steps (Analysis). (B) Representative scheme of the SHIP software. As inputs, a genomic annotation (.gff3), a regulatory annotation (.gff3), and two files (.json) with the indication of genetic parts to be considered for pGSH selection. As output files, the algorithm returns a table and a graph (.png) with the distribution of intergenic distances and a file (.txt) with the list of intergenic regions and regulatory aspects. (C) Histogram with the genomic distribution of the intergenic regions between the three possible arrangements of the flanking genes. (D) Chromosome number, coordinates, neighboring genes, and size of the intergenic regions identified as pGSH. (E) Ideogram marking the identified pGSHs generated with Ideogram.js76.

As previously shown, S. cerevisiae presents a condensed genome22, with most intergenic regions shorter than 1 kb with flaking genes in tandem (Fig. 1C). The yeast genome has an average of 1 gene for every 1,728 bp, directing the putative GSH (pGSH) size range between 1.2 and 1.7 kb. The H. sapiens (GRCh38.p13) pGSHs range was defined as 50,300–50,800 bp, and M. musculus (GRCm38.p6), with a pGSH size range of 48,700–49,200 bp. Notably, even for organisms with knowingly not uniform gene distribution, such as humans and mice, the canonical previously established criteria that locate GSHs in intergenic regions allowed the joint use of the average gene length distribution, expanding the identification parameters.

For S. cerevisiae, the program predicted six pGSHs distributed throughout five chromosomes (Fig. 1D and Table S1). The SHIP program identified 16 pGSHs in H. sapiens and 11 in M. musculus (Fig. S1, S2). These results included the name, description, and cross-references of neighboring genes and the FASTA sequence of pGSHs (see the Data availability section) https://figshare.com/s/7fcfade8b77f4d41140c. For multicellular organisms, as practical examples, H. sapiens and M. musculus, along with the pGSH list, the program also returns the pGSHs’ regulatory features in several cell types for the given species. In addition, parameters already defined by previous studies1,17 must be taken into consideration. Importantly, all regions were manually curated, confirming the algorithm’s functionalities and precision.

Supplementary Table S1 presents detailed information about each pGSH identified in S. cerevisiae. pGSH-1 is on chromosome 2, and it is 3,780 bp from the telomere. pGSH-2 is on chromosome V, 7,895 bp away from the telomere. From the Saccharomyces Genome Database (SGD)23, this region has the epigenetic mark H3K4me3, which is related to high transcriptional levels24. pGSHs 3 and 4 belong to chromosome V. pGSH-3 is 12,337 bp from the telomere, presenting DNAseI hypersensitive sites. pGSH-4 is located in the middle of the left arm of the chromosome. pGSH-5 is on chromosome XI, 14,902 bp from the telomere. pGSH-6 is at the middle region of chromosome XV, marked by H3K4me3. Notably, pGSH-2, 3, and 5, in addition to being close to subtelomeric regions, one of the flanking genes belongs to gene families.

Generation of S. cerevisiae strains with reporter genes in each GSH

Five of the six pGSHs identified by SHIP were selected for in vivo validation. Due to its relative proximity to the telomeric region, pGSH-1 was discarded from the experimental analysis. The BioBricks (ymUKG1 and URA3) were directed centrally within each pGSH to reduce the probability of gene expression perturbations. Furthermore, the BioBricks were aimed at sites with favorable epigenetic marks for open chromatin states, such as H3K4me3 and DNase I hypersensitive sites, to improve insertion efficiency and expression. These BioBricks were delivered as separate DNA pieces, simulating the assembling approaches to build metabolic pathways and genetic circuits. Following PCR identification of the clones with correct BioBricks insertion (Fig. S3), three clones were selected for each pGSH, and their PCR amplicon was sequenced to confirm their insertion (Fig. 2 and Fig. S4). Each pGSH clone exhibited the expected insertion site, confirming their suitability for further validation as true GSHs (see the Data availability section) https://figshare.com/s/fc3b89fe7077edac4e2b.

Fig. 2
figure 2

Sanger sequencing of the clones demonstrates correct insertion in the region of each GSH in the 5 lineages. (A) Representation of PCR amplification for sequencing using primers targeting the genomic region outside homology arms (HR) marked in yellow. ymUkG1 is marked in green and URA3 in pink. (B) S2U. (C) S3U. (D) S4U. (E) S5U. (F) S6U. Multisequence alignment was performed with the MAFFT v7 program77 using the following parameters: Gap opening penalty of 1.53; Gap extension penalty equal to 0 and quick direction adjustment function enabled.

Identified GSHs present genomic stability without fitness loss and an increasing copy number near subtelomeric regions

One of the primordial characteristics of a GSH is its stability and capacity to keep the exogenous insertion through generations with no adverse fitness effects. Thus, the strains transformed with the reporter genes were cultivated in liquid YPD for ten days continuously, and the BioBrick’s presence within each GSH was verified by PCR using genomic primers (Fig. 3A). All the GSHs presented genomic stability after approximately 100 mitotic generations (Fig. 3B and Fig. S5).

Fig. 3
figure 3

GSHs genomic characterization and Growth curve for each GSH cell line. (A) Schematic strategy of approximately 100 mitotic generations. (B) PCR analysis of each one of the five GSH cell lines after approximately 100 generations with genomic primers specific for complete amplification of the insert. For each biological triplicate, fifteen colonies were randomly collected. BY4741 as positive control (C+) and H2O as negative control (C−). A cut of each gel is shown, removing wells and unused parts. Full-length gels are included in the Data availability section. https://figshare.com/s/c2d9bcc901d4114e11e8 (C) Schematic strategy for the growth rate analysis of the five GSH cell lines compared to the BY4741, as control. (D) Average and standard deviations of growth curves measured at 0, 2, 4, 6, 12, 24, and 36 h with an initial OD600 of 0.1. (E) Copy number analysis of the ymUkG1 gene in 3 clones of each GSH cell line. (F) Graphic bars showing the mean and standard deviation of the percentage of cells expressing ymUKG1 on all GSH cell lines. Experiments performed in technical and biological replicates on three independent days. Bioicons from Servier Medical Art licensed under CC BY 4.0.

The growth curve for each GSH strain was comparable but distinct from the control (untransformed BY4741) after 36 h in YPD medium (Fig. 3C, D). The calculated growth rates were between 0.35 and 0.39 for the GSH lines and 0.30 for the control (Fig. S6), indicating a detectable difference in adaptive value between them, being higher for the lines where uracil prototrophy was restored and demonstrating no fitness loss.

Previous studies have demonstrated that subtelomeric regions are double-strand break hotspots with a high recombination rate25,26. This activity contributes to the emergence and evolution of gene families27,28. Since SHIP identified three GSHs near these regions, we assessed the copy number of the reporter transgene using qPCR. The results showed that GSH-4 and GSH-6 had a single copy at the central region of their chromosome arms. The GSHs located at subtelomeric regions presented more than one copy of the transgene (Fig. 3E). GSH-2, the one closest to the telomere (7.9 kb away), presented an average of seven copies. In contrast, GSH-3 (12.3 kb) had three copies, and GSH-5 (14.9 kb) presented two copies, potentially indicating to the user that targeting inserts to GSHs close to subtelomeric regions can lead to the insertion of more than one copy into the genome.

GSH cell lines expressed the reporter gene in more than 90% of the population, preserving the morphological characteristics

Flow cytometry analysis revealed that the GSH lineages expressing ymUkg1 exhibited an average detectable green fluorescence in more than 90% of the population (Fig. 3F). This percentage was achieved and maintained from 2 to 36 h of growth. No detectable variation was observed in the fluorescent population throughout cultivation and among GSHs (Fig. S7 and Fig. S8), indicating that each strain was capable of expressing and maintaining stable and predictable expression of the transgene (see the Data availability section).

Scanning electron microscopy (SEM) was used to investigate any morphological impact on the GSH cell lines. Compared to BY4741, the GSH strains did not display relevant cell surface changes, being indistinguishable from the control and from each other (Fig. S9 and Fig. S10).

Differentially expressed genes are not related to the insertion site in GSH cell lines

One of the fundamental characteristics of a GSH is its capacity to ensure predictable BioBrick transcription with no or minimal interference on the local and global gene expression. Therefore, we analyzed the expression of the transgene and of genes directly in contact with the GSHs (Neighboring Genes) using RT-qPCR (Fig. 4A). The results demonstrate an accumulation of the ymUKG1 transcript for all GSH strains. pGSH-5 which has 2 copies of the transgene, had the highest ymUKG1 transcript relative expression, followed by pGSH-4 with one copy. Although pGSH-2 and 3 (the ones closest to the telomeres) have the highest number of ymUKG1 copies, their relative expression is equivalent to pGSH-6, with a single copy (Fig. S11). Notably, regardless of the GSH used to host the transgene, four neighboring genes across all the predicted GSHs exhibited persistent variation in expression, which was not correlated to the specific GSH employed (Fig. 4B–F).

Fig. 4
figure 4

Expression dynamics of neighboring genes of GSHs cell lines by RT-qPCR. (A) Schematic representation of the genes analyzed. Histogram of the relative expression of all neighboring genes for each GSH lineage. (B) S2U. (C) S3U. (D) S4U. (E) S5U. (F) S6U. Green column indicates expression of the ymUKG1 gene. Colored columns highlight neighboring genes (Purple at 5’ and Blue at 3’) of the GSH of the analyzed strain and gray columns show the neighboring genes of the other GSHs. The cracked columns represent values from the untransformed control (BY4741). Relative expression on the Y axis and genes analyzed on the X axis. Asterisks indicate significant changes in the analysis of variance (ANOVA). (*) p < 0.05, (**) p < 0.01, (***) p < 0.001 and (****) p < 0.0001.

Finally, we assessed possible interferences from BioBrick insertion in overall host genomic expression. Global gene expression analysis was performed using RNA-Seq on one clone of each GSH cell line once the insertion occurred precisely at the same site. We analyzed the transcriptome of each GSH construct in biological replicates against the control strain BY4741. All GSH cell lines exhibited around 16% of genes differentially expressed compared to the control, displaying a high degree of similarity among themselves (Fig. 5A and Fig. S12). Both RT-qPCR and RNA-Seq analysis corroborated previous studies in which differential expression (DE) analysis revealed significant differences for several genes between GSH and control human cell lines, and similarity among them17.

Fig. 5
figure 5

Genomic expression dynamics of GSHs cell lines by RNA-Seq. (A) PCA graph of the differential gene expression of the GSHs lines (S2U, S3U, S4U, S5U and S6U) in relation to the untransformed control BY4741. Vulcanos Plots of differential expression of GSHs cell lines by RNA-Seq. (B) S2U. (C) S3U. (D) S4U. (E) S5U. (F) S6U. Colored dots represent differentially downregulated (blue) and upregulated (red) genes. Described genes are identified with their standard name and undescribed or putative genes are identified with their systematic name.

URA3, intentionally recovered as one of the reporter genes, was the most positively regulated gene, whereas URA1 was the most negatively regulated in all experimentsFig. . 5B–F), as expected29,30. Moreover, it is important to report that according to previous works, the recovery of auxotrophy vias affects genomic expression, again corroborating our DE analysis data17.

Most of the differentially expressed genes are shared among all GSH cell lines

Considering that the expression profile of GSHs was similar between cell lines and that the BioBricks were in different loci and chromosomes, the reintroduction of URA3 could cause this apparent genomic perturbation, as already observed by previous studies31,32,33. To elucidate this possibility, the level of sharing of differentially expressed genes between the GSHs and whether, among the shared genes, there is some functional enrichment, the results of the DE analysis of the RNA-seq experiment for each GSH were processed. For all GSH lines, we performed a paired DE analysis against the untransformed control BY4741.

The group of upregulated genes from all experiments totaled 1257 upregulated genes (Fig. S13A). Of these, 423 (33.65%) had increased DE in all experiments, being the most frequent subgroup of genes among the clone combinations. It is worth mentioning that the GSH-6 and 4 clones presented 120 and 83 upregulated genes, respectively, which were not shared with the other lineages. In all experiments, 1,086 unique genes showed a significant decrease in expression (Fig. S13B). Of these, 257 (23.66%) were downregulated in all cell lines, constituting the most frequent subgroup of genes.

To verify whether these groups of shared genes, positively or negatively regulated, were related to GSHs, the data was mapped to chromosomal regions. The sites with differentially expressed genes were spread throughout the yeast genome, showing no positional correlation with any of the identified GSHs (Fig. S13, C–G). Moreover, there is no overlap between these regions of differentially expressed genes and the GSH loci, further corroborating previous studies17,18 and indicating no link between transgene insertion and differential expression.

Lack of gene ontology enrichment in differentially expressed genes

A gene ontology enrichment analysis of the differentially expressed genes shared among all strains was performed to verify possible functional perturbations in the host genome. The upregulated gene set displayed significant enrichment for the term cellular component (CC) of the ribosomal unit and molecular function (MF) related to the structural component of the ribosome (Fig. S14A, B). In general, this might suggest more prominent translational activity. There was no enrichment of gene ontology terms for downregulated genes shared across all GSH constructs (Fig. S14C).

Even without substantial ontological enrichment, the most differentially expressed genes relative to the untransformed control were analyzed. They showed again that gene expression distribution was generally affected, with URA3 being the most strongly upregulated and URA1 being the most strongly downregulated across all the experiments (Fig. S15). However, most of the genes were unannotated, putative, or with dubious ORFs, comprising 73.33% of the positively regulated genes and 80% of the negatively regulated ones (Table S3 and Table S4), suggesting that fundamental biological processes and essential metabolic and stress response pathways were not affected.

Generation of triplex GSH cell line expressing ymUKG1, ymBeRFP, and α-amylase

As proof that the predicted GSHs can serve as landing pads for the simultaneous expression of entire biological pathways or complex genetic circuits, we constructed a triplex strain expressing ymUKG1, ymBeRFP34, and α-AMY35. To construct the triplex, the strain containing ymUKG1 within GSH-2 (Strain S2U) was transformed with α-AMY targeting GSH-5 (Strain S2U5A), and sequentially, ymBeRFP was inserted in GSH-6 (Strain S2U5A6B) (Fig. 6A).

Fig. 6
figure 6

Functional and expression analysis of the S2U5A6M triplex GSH cell line. (A) Sanger sequencing of the S2U5A6M strain. (B) Growth and amylase activity of S2U5A6M clones and growth of S2U6M (no-amylase control) on minimal medium containing soluble starch. The plate was stained with iodine vapor and clear halos indicate starch hydrolysis. (C) Amylase enzyme activity assay. All clones significantly differed from BY4741 according to a Mann-Whitney U test but did not show statistically significant differences between themselves. (**) p < 0.01 and (***) p < 0.001. (D) Expression dynamics of neighboring genes by RT-qPCR. Green, yellow, and red columns respectively indicate expression of the ymUKG1, ymBeRFP, and α-amylase (α-AMY). Colored columns highlight neighboring genes (Blue at 5’ and Purple at 3’) of the GSH used and gray columns show the neighboring genes of the other GSHs. The cracked columns represent values from the untransformed control (BY4741). Relative expression on the Y axis and genes analyzed on the X axis. Asterisks indicate significant changes in the analysis of variance (ANOVA). (***) p < 0.001 and (****) p < 0.0001.

S2U5A6B presented a capacity to degrade starch (Fig. 6B and Fig. S16) despite a modest amylolytic activity (Fig. 6C). It also exhibited expression of α-AMY and ymBeRFP, without affecting the neighboring genes (Fig. 6D). Most importantly, the result showed that it was possible to simulate the construction of polygenic circuits of interest using the GSHs described by SHIP.

Discussion

We used the program SHIP to identify six regions with characteristics of GSHs in S. cerevisiae and validated five of them (GSH-2 to 6) in vivo (Fig. 1). These fulfilled fundamental criteria for a safe landing pad:1 accessibility to transgene integration (Fig. 2)2, genomic stability through 100 mitotic generations (Fig. 3B) without reducing cell viability (Fig. 3D and Fig. S6)3, predictable and constant expression of the transgene (Fig. 3F and Fig. S7), and4 minimal interference with genomic and flanking gene expression (Figs. 4 and 5). As GSH cell lines restored the uracil biosynthetic pathway by reintroducing URA3, a higher growth rate than the control strain was expected (Fig. 3D and fig. S6)31,32. Flow cytometry data showed that approximately 90% of the cells accumulated the reporter protein (Fig. 3F and Fig. S7), demonstrating high transgene expression with no morphological effects (Figs. S9, S10). qPCR results revealed that targeted insertion of BioBricks into GSHs near telomeric regions (GSH-2, 3, and 6) resulted in more than one gene copy. In contrast, those in the middle of the chromosome arms (GSH-4 and 6) resulted in one copy (Fig. 3E). These additional inserts found in clones with GSH near the telomeric regions can be considered off-targets. Although these follow biological characteristics of subtelomeric regions25,26 further investigations will be needed to identify the locations of these insertions and to determine whether they occur during or after the transformation process. Analysis of the four annotated genes flanking GSH-2, 3, and 5 revealed that three belong to gene families (PUG136, ARN137, and COS938). However, the number of copies did not correlate to the ymUKG1 transcript accumulation (Fig. S11) since GSHs-2 and -3, probably due to their telomere proximity, had the lowest relative expression profile. This information gives the user of the SHIP tool flexibility in selecting putative GSHs based on the intended transcript accumulation, allowing fine-tuning of gene expression control.

RT-qPCR and RNA-Seq results from our study corroborate findings that transgene insertion affects the cell’s transcriptional program17,18. Previous studies aimed to develop software to identify GSHs in humans detected significant differential expression of several genes spread across different chromosomes. Furthermore, most DE genes are shared between cell lines, even at different integration sites. Approximately 33% of genes were upregulated, and around 25% were downregulated. Among the differentially expressed genes, most were detected across all GSH cell lineages (Fig. S13A, B), most being unannotated, putative, or with dubious ORFs.

Additionally, no functional enrichment was observed among these, without indication of a coordinated cellular response, such as a response to cellular stress or cell division reprogramming (Fig. S14). Considering that the expression profiles of GSH cell lines were very similar and the BioBricks were at different loci, the genomic perturbations did not follow a clear trend linked to the transgene insertion. Restoring uracil metabolism led to higher growth rates than the control and possibly contributed to this differentiation in genomic expression, as shown by previous studies31,32,33. As demonstrated by Alam, auxotrophic lineages presented 85% of the transcriptome with differential expression when compared to prototrophic lineage33. Nevertheless, this does not exclude the possibility of long-term effects of stress, particularly regarding epigenetic changes that may modulate genomic expression39,40,41,42.

Corroborating our data, three of the five GSHs identified by the SHIP program have been previously used as regions for transgene insertion in S. cerevisiae, indicating consistency in our predictions. Previous studies that identified these regions searched for intergenic sequences but did not conduct further analyses to validate the use of GSH-243, GSH-544, and GSH-645.

Furthermore, the GSHs described in this study supported multiplex insertion, paving the way for polygenic assemblies (Fig. 6). Despite this, the insertion of an entire metabolic pathway is a more complex approach that will demand deeper investigation related to the length limit of each BioBrick inserted in each GSH to avoid flanking gene perturbations and possible variation in insert expression depending on the GSH’s tridimensional position in the nucleus.

Finally, the SHIP program will significantly contribute to the rational and systematic identification of GSHs for eukaryotic organisms with an annotated genome, returning an overview of the genomic density and a detailed description of each region, leading to time and cost reduction for obtaining lineages of interest. It is worth highlighting that the algorithm deals with the genome as a conjunction of annotated parts. As it does not make any priori predictions based on the DNA sequence, the SHIP program has some shortcomings that may affect its applicability to other species. It depends on high-quality genome assemblies, ideally at the chromosome level, as well as accurate annotations. A lack of assembly contiguity and imprecision in gene, repeats, or regulatory region prediction can directly impact the results. Additionally, specific parameter adjustments are necessary to account for the characteristics of the taxa being studied, such as the range of intergenic lengths. The cross-species applicability of the software also requires careful analysis, considering the specific characteristics of each genome. Therefore, as new species are analyzed, it will be essential to incorporate specific modules in future updates. Furthermore, there may be cases where predicted GSHs do not pass experimental validation or exhibit behaviors that were not anticipated due to the presence of unannotated genetic parts. In addition, future improvements should include aspects of genome 3D structure and epigenetic marks related to chromatin states as an option in the SHIP software. In conclusion, the SHIP returns a list of putative GSHs requesting a critical analysis to choose the pGSH that best meets the users’ needs, providing the cornerstone to characterize and explore such genomic regions.

Materials and methods

Experimental design

The objectives of this work were the development of a software for the identification GSH in eukaryote genomes and the experimental validation of this tool in S. cerevisiae. The first step was the design of the Safe Harbor Identification Program (SHIP), an algorithm for identifying genomic safe harbors in eukaryotes using general feature data (Design). S. cerevisiae was selected for in vivo validation, leading to its genetic transformation with two overlapping BioBricks containing a reporter gene and an auxotrophic marker (Test), followed by subsequent results analysis (Analysis) (Fig. 1A).

Rules for predicting GSHs

The rule for predicting GSHs must change depending on the target genome, but two characteristics are constant: the orientation of neighboring genes and the length of the intergenic region. The orientation of neighboring genes was convergent for all three organisms analyzed in this study. The length of the intergenic region was determined as the rounded value of the average gene per bp as the upper limit and 500 bp less as the lower limit.

For the S. cerevisiae genome, the criteria for predicting GSHs were the absence of annotated gene, ncRNA, pseudogene, centromere, telomere, long terminal repeat (LTR), mobile genetic element, origin of replication, transposable element gene, meiotic recombination region, sequence annotated feature, autonomously replicating sequence, telomeric region sequence, mating-type cassette, retrotransposon mRNA, expressed sequence tag (EST), and predicted transcription factor binding sites. Additionally, for H. sapiens excluded regions with transfer RNA (tRNAs) and long intergenic non-coding RNAs (lincRNAs), and for M. musculus, transfer RNA (tRNAs), protein-coding and non-protein-coding genes (ncbiRefSeq). The SHIP algorithm included these rules for predicting pGSHs as Features and Tracks files.

SHIP algorithm

The Safe Harbor Identification Program (SHIP) is written in Python and searches genomic safe harbors in eukaryotes with complete and annotated genome sequences available in public databases (NCBI or Ensembl).

The inputs to SHIP are1 a genomic annotation in general feature format file (.gff);2 an optional regulatory annotation in (.gff) to characterize the GSH;3 two files (.json) with the indication of genetic parts to be analyzed (Features) and to be considered as GSH exclusion filter after verification of its presence in the University of California Santa Cruz (UCSC) Genome Browser46 (Tracks);4 characteristics desired by the user when searching for GSHs (Intergenic size range (bp), Neighboring genes orientation, and Analysis options). As outputs for sequential analysis, SHIP returns tabular and graphical reports with the distribution of intergenic distances and frequency of neighboring genes, the list of GSHs regions with the neighboring genes with their information from several databases, such as Uniprot47, RefSeq48, GO49, UniParc50, String, and among others. The program is available at www.github.com/MCLeitao/Ship.

The modular auxiliary files (Features and Tracks) are required to expand the algorithm’s scope and give flexibility based on user necessity, annotation quality, and genomic particulars of each organism. Features and Tracks used for each organism are listed in the supplementary (Table S2).

The SHIP algorithm has the following steps:

  1. 1.

    Organize the genome. Sort and filter the GFF, select the parts considered for the next steps (indicated by the user through the features.json file) and remove the completely overlapping annotation.

  2. 2.

    Table assembly and data charting. At this point, the algorithm identifies the intergenic regions by calculating the difference between the downstream gene start and the end of the upstream gene. A table is returned for the user with the number of chromosomes and genes identified, the number of intergenic intervals in total and within each neighboring gene orientation (Tandem, Divergent, and Convergent), and the disregarded amount of totally overlapping genes. The data is charted to facilitate the user visualization of the target genome distribution and organization.

  3. 3.

    Defining the GSHs characteristics. With the chart containing the target genome distribution and organization, the user can select the flanking gene orientation and the minimum and maximum size of the GSHs. Suppose the user has a GFF file with all regulatory features, such as DNA methylation, histone modifications, chromatin conformation, transcription factor binding sites, enhancers, repressors, and microarray annotations. In that case, the features within each GSH will be recorded for further processing.

  4. 4.

    Annotation tracks analysis. Based on the track file, the algorithm looks for these annotation tracks (indicated by the user) within each selected GSH through the genomic coordinate data from the UCSC Genome database. If the presence has been confirmed, the GSH is discarded.

  5. 5.

    Flanking genes analysis. Using the Ensembl REST API51, all genetic features linked to the neighborhood genes ID and all external references are recorded. It includes information related to the genes’ transcripts and translations. Especially when dealing with a multicellular organism, the algorithm also searches for the Regulatory Feature recorded in each Epigenome for the given species.

Strains and media

S. cerevisiae BY4741 (MATa, met15Δ, his3Δ, ura3Δ, leu2Δ)52 was obtained from the EUROSCARF collection (Y00000), and it was used to generate all the yeast cell line strains described in this study. Yeast was routinely grown in YPD medium (10 g L’ yeast extract, 20 g L’ peptone, and 20 g L’ glucose). The solid medium used 2% agar. MD medium used 0.34% Yeast Nitrogen Base (YNB), 1% (NH4)2SO4, 2% glucose and was supplied with 0,002% methionine, 0.002% histidine, 0,002% uracil or 0,01% leucine when needed.

DNA cloning of the single-copy reference gene TAF10, constructed from plasmid pPCV, was performed using chemically competent Escherichia coli XL-10 Gold (Agilent Technologies) grown in Luria-Bertani (LB) medium (5 g L’ yeast extract, 10 g L’ peptone, and 10 g L’ NaCl, pH 7.2). When needed, agar was added to a final concentration of 1.5%.

BioBricks assembly and yeast transformation

BioBricks (Table S5) design followed the homologous recombination strategy of 60 bp overlap arms and convergent gene orientation (see Table S6 for a list of primers). BioBrick one was composed of the monomeric version of a coral-derived green fluorescent protein coding gene53, codon-optimized for S. cerevisiae (ymUKG1)54, driven by TEF1 constitutive promoter. The second one was the auxotrophic marker URA3 from the plasmid pYC21055 with its promoter and terminator. Following the manufacturer’s recommendations, both were amplified using Platinum™ Taq DNA Polymerase High Fidelity (Invitrogen). PCR reactions were purified using the ReliaPrep™ DNA Clean-Up and Concentration System (Promega), concentrated in the Speed Vacuum, quantified in NanoDrop™ OneC Spectrophotometer (Thermo Fisher), and visualized on an agarose gel to verify integrity, size, and corroborate quantification.

S. cerevisiae BY4741 was used as the parental strain to generate the yeast strains with the BioBricks within each tested genomic safe harbor. BioBricks were assembled by in vivo homologous recombination through a transformation of yeast cells using the following protocols: the High-efficiency transformation by electroporation56 and the LiAc/SS-DNA/PEG transformation57 with the modifications58.

PCR identification and Sanger sequencing

After transformation, colonies were randomly selected to identify the correct insertion clones. Each transformed cell was grown in YPD medium overnight and had the genomic DNA extracted using the Wizard® Genomic DNA Purification Kit (Promega) following the manufacturer’s recommendations. The samples were amplified by PCR with the GoTaq® G2 DNA Polymerase (Promega) using genomic primers (Table S7) for complete amplification of the insert. Three clones of each GSH were selected for posterior analyses.

For the samples with correct insertion, PCR replicates were performed with the Platinum™ Taq DNA Polymerase High Fidelity (Invitrogen), genomic oligonucleotides, and following the manufacturer’s instructions. Then, the samples were analyzed using 1% agarose gel electrophoresis to confirm the size of the amplicon and separate any non-specific or partial amplifications. All image acquisitions were made using the Gel Doc EZ System 1,708,270 (Bio-Rad) and the Image Lab™ software with the default parameters. The gels were not processed. The bands of the expected size were cut from the gel, and the DNA was purified using the QIAquick PCR & Gel Cleanup Kit (Qiagen). Sanger sequencing was performed by ACTGene Análises Moleculares (Brazil) using the AB 3500 Genetic Analyzer (Thermo Fisher).

Genomic stability test

The genomic stability analysis of the GSHs regions after BioBricks insertion was performed using the continuous cell passage method58. The colonies were initially inoculated in 5 mL of YPD and grown overnight. Then, 5 µL of this culture was transferred to 5 mL of fresh YPD medium. This cycle was repeated for ten days, resulting in approximately 100 generations. Subsequently, the cells were plated on YPD, and fifteen clones were randomly selected and analyzed by PCR using the genomic primers (Table S7) for complete amplification of the insert for each technical triplicate.

Copy number analysis

The number of copies of the reporter gene integrated into each GSH was analyzed by qPCR using primers for the ymUKG1 gene and the single-copy reference gene TAF1059 (Table S8). qPCR was performed using Applied Biosystems’ Fast SYBR Green Master Mix and the QIAGEN Rotor Gene Q thermal cycler. Each qPCR reaction used 10 ng of genomic DNA. The number of copies of ymUKG1 in each genome was analyzed by absolute quantification using plasmid-based standard curves for calculating the absolute number of copies of each gene (ymUKG1 and reference gene TAF10) in the samples, according to the thermal cycler manufacturer’s protocol60. The standard curve for ymUKG1 used plasmid YCUkGy, constructed in this study and 6972 bp in size, while the standard curve for TAF10 used plasmid pPCV-TAF10, constructed in this study from plasmid pPCV and 3,012 bp in size61. The number of copies of ymUKG1 in each genome was obtained by normalizing the results for the reporter and reference genes.

Flow cytometry and growth rate analysis

The accumulation of ymUkg1 protein was analyzed by flow cytometry using the BD FACSVerse™ Cell Analyzer (BD Biosciences) configuration 4-Blue 2-Red (ROU) and flow between 500 and 900 events per second. Pre-inoculated clones grown overnight were inoculated in 20 mL of YPD in 125 mL Erlenmeyers with an initial OD600 of 0.1. Growth was measured at 0, 2, 4, 6, 12, 24, and 36 h using the NanoDropTM OneC (Thermo Fisher). For each evaluated time, 1 × 106 cells were collected and fixed with 4% Formaldehyde and 1% Phosphate-buffered saline (PBS) for 30 min at room temperature and stored at 4 °C until use. Analysis of the cytometer occurred on the same day of fixation or within two days. Each assay was conducted using technical and biological triplicates, analyzing 25,000 events. Data was analyzed using the FlowJo™ v.10.9.0 software (BD Biosciences). Data were validated by FlowAI62 to select events with SSC-A/FSC-A patterns compatible with viable cells, then cells with FSC-H/FSC-A patterns of single and non-overlapping cells were analyzed for fluorescence detection by the FITC-A channel. Once these analysis gates were determined, all samples were selected and processed under the same gates in the same workspace.

Electron microscopy

The GSH cell lines were analyzed under a scanning electron microscope (SEM) to visualize possible cell shape and surface changes. The cells were fixed with Karnovsky fixative solution (2% glutaraldehyde, 2% paraformaldehyde, and 3% sucrose in sodium cacodylate buffer 0.1 M pH 7.2) overnight at 4 °C. Cell cells were washed with 0.1 M sodium cacodylate buffer, pH 7.2 the following day. A drop of the cells was seeded on 18 × 18 mm coverslips covered with Poly-D-Lysine for 3 min. After adhesion, the coverslips with cells were incubated in vapor of sodium tetroxide 2% osmium for 30 min and then washed with distilled water. Dehydration was carried out in increasing series with acetone (50–100%) and, finally, drying to a critical point using CPD 030 (Balzers) and SCD 500 metalization (Leica), to be analyzed in a JSM 7001 F scanning electron microscope (15 kV) (Jeol).

Total RNA extraction

For each of the GSH cell lines and the control BY4741, one clone was selected to extract total RNA in biological replicates performed on independent days. The strains were cultivated in 20 mL of YPD within 125 mL Erlenmeyers for 16 h. Afterward, 2 mL of each cultivation was employed for total RNA extraction with the RNeasy Plus Micro Kit (Qiagen), following the manufacturer’s instructions. The extracted RNAs were quantified using the Qubit 2.0 Fluorometer (Thermo Fisher) and a quality check using electrophoresis and the 2100 Bioanalyzer (Agilent). Finally, a purity analysis was conducted with NanoDropTM OneC (Thermo Fisher) to detect possible DNA contamination.

RT-qPCR

For the RT-qPCR, the cDNA library was assembled with the High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems). Random Primers and 2 µg of each RNA sample as templates were used, heeding the kit protocol. The cDNA was diluted 10x, and 2 µL of this dilution was used for each qPCR reaction in a final volume of 10 µL. A negative control was included for each gene analyzed (NTCs). qPCR reactions were performed using Syber Green Master Mix (Thermo Fisher) following the manufacturer’s instructions in the QuantStudio 3 Real-Time System (Thermo Fisher). Replicates were run on the same plate for each sample, and to avoid systematic differences within the PCR procedure, samples were randomly distributed on the plate. Genes ACT1, ALG6, and TAF10 were used as internal controls to normalize the data59. The Ct value was used to estimate the specificity of the amplified products. The primers used and cycling conditions are listed in (Table S8).

RNA-Seq sequencing and data processing

RNA-Seq analysis of each of the GSH cell lines and BY4741 was performed by GenOne Biotechnologies (Brazil) using the NovaSeq 6000 Sequencing System (Illumina). Supplementary Table S9 presents the total count for each yeast cell line analyzed in this study. All analyses were performed using the reference genome of S. cerevisiae S288C (R64.5.1).

For sequence quality control, the fastp v0.23.4 software was used for adapter removal and low-quality sequencing trimming using Q30 as the quality threshold63. Sequence mapping against the reference genome was done with the STAR 2.7.11b software64, and the software htseq-count 2.0.365 was used to estimate the transcript counts. Differential expression estimation was performed in the R environment with the libraries edgeR 4.2.066 and DESeq2 1.44.067. The samples were filtered with a value of statistics p < 0.05 and adjusted to control the false discovery rate (FDR) the Benjamini-Hochberg method (R package stats 3.6.2) by setting FDR < 0.001. The data was visualized (Heatmaps and PCAs) using the ClustVis library 0.7.768 and Volcanos Plots with ggplot 2 3.5.169, with a limit of log2FC ± 0.6 to highlight the significant differences. We used pheatmap 1.0.1270 for visualization and plotting.

Ontology enrichment analysis

Gene ontology enrichment analysis of the set of differentially expressed genes shared among all GSHs lineages was carried out using the program g: Profiler71, with a significance threshold of FDR < 0.01.

Amylase enzyme activity

The transformed strains and BY4741 were grown in YPD medium containing 0.4% of Asp-Glu buffer for 72 h (with 2% glucose feeds every 24 h) at 28 ºC and 200 rpm. 1 mL of each culture was centrifuged at 4,000 rpm for 2 min, and the supernatant was used in the enzyme activity assay. Enzyme activity was measured using a variation of the starch-iodine method for quantifying soluble starch as the remaining substrate after amylase activity72. 100 µL of a 0.5% starch solution was heated at 40 ºC along with 40 µL of 0.5 M sodium acetate buffer. 60 µL of the supernatant of each sample was added to the mixture and incubated at 10–30 min at 40 ºC. After incubation, 200 µL of 1 M acetic acid was added to stop the reaction, along with 200 µL of an iodine solution (0.2% I2, 2% KI). 4.4 mL of water were added to each reaction and the results were read in a Spectramax M5 Spectrophotometer at 660 nm. A standard curve of starch concentration was constructed, ranging from 0.1 to 0.5 mg, and one enzyme unit was defined as the amount of enzyme necessary to hydrolyze 0.1 mg of starch per minute.

Statistical analysis

The candidate gene relative expression analysis was conducted using the Biogazelle qBase +3.0 software73. The normalization of gene expression was carried out using the most stable reference genes identified by the geNorm74. geNorm was used to compute the paired variation (Vn/n + 1) between each reference gene, determining the optimal number of reference genes required for accurate normalization. The relative expression results were further analyzed using the REST© 200975. GraphPad Prism (version 9.0) was employed to perform a two-way ANOVA with a 95% confidence interval for the differences and a significance level of p < 0.05 between the groups.