Introduction

Cotton is an essential plant cultivated globally for its fiber as raw materials for fabric production, seeds for feed and food supplements, and straw for bioenergy generation1,2,3,4. Nevertheless, cotton production is constrained by disease-causing microbes, including fungi, which are estimated to exert approximately 15–20% yield loss annually5,6,7. Some fungal diseases of cotton include Fusarium and Verticillium wilt8, Alternaria leaf spot (foliar disease)9, and boll rot caused by Corynespora cassiicola10. Foliar-related infections form approximately 80–90% of cotton diseases9,11.

Verticillium (genus) dahliae(species) belonging to the phylum Ascomycota, class Sordariomycetes, and family Plectosphaerellaceae is an established member of the filamentous fungi genera classified by Nees von Esenbeck in 181712. V. albo-atrum was the first pathogenic strain isolated in 1879 from potatoes in Germany12. The diversity of Verticillium has now expanded, forming several groups, including saprophytes, plants, and animal disease-causing Verticillium spp13. Current DNA sequencing and phylogenetic tools have accelerated the discovery of the evolutionary histories of these species and resolved several outstanding controversies in their identifications14. Apart from cotton and potato plants, V. dahlia has a broad range of hosts, such as watermelon, tomato, strawberry, lettuce, eggplant, chilli pepper, and cabbage12,15. The symptoms accompanying V. dahliaeinfections include leaf wilting, plant dysplasia, vascular bundle browning, yellowing, and early death16. These symptoms highlight the economic importance of the pathogen.

Several whole-genome sequencing studies of diverse strains of V. dahliae with varying genome sizes, scaffold lengths, and effector proteins have been identified. The defoliating and nondefoliating strains of V. dahliaeXJ592 and XJ511 showed varied genome characteristics, such as 35 and 34 million each of genome sizes, respectively. The GC contents in the XJ592 and XJ511 strains also exhibited slight variations of 35.28% and 53.98%, respectively17. The secretory proteome was predicted to be 793 in XJ592 and 794 in XJ511, indicating the closeness of these strains. Identifying these proteins is significant in designing appropriate control factors against the diseases they cause. For example, Verticillium dahliae SnodProt1-Like Protein, VdCP1 expression increases throughout the infection process of V. dahliae, and its mutant lines showed no effect on pathogenesis18.

Different fungi strain harbor varying pathogenicity effects and can interact differently with host immunity19. These variations can be elucidated at the genomic level via genome sequencing analysis. Here, we report the genome-sequencing of a novel V. dahliae strain, herein referred to as huangweibingjun, isolated from diseased cotton roots. We showed the genome architecture, predicted proteome, and virulence factors governing the pathogenicity of huangweibingjun.

Results

Morphological and evolutionary analysis of Huangweibingjun

Naturally infected root samples of cotton plants exhibiting wilting symptoms were collected from the field, surface-sterilized, inoculated on potato dextrose agar (PDA), and incubated in darkness. Observations of the huangweibingjun strain under a stereomicroscope (MZ16 F, Leica Microsystems, Germany) revealed fluffy, creamy white mycelia densely surrounding the inoculated roots and gradually spreading (Fig. 1a). Numerous oval conidia were produced, appearing detached from the conidiophores (Fig. 1a). Further evolutionary analysis (Maximum Likelihood phylogenetic tree) (Fig. 1b) of the protein sequences of huangweibingjun after genome sequencing showed that it is closely related to the getta_getta strain of V. dahliae but distantly related to the Gwydir1 A3, V152, s12251, and the VdLs.17 (Fig. 1b). A petal map of the pan and core genes shared among the reference and assembled genomes showed that 6,594 genes were shared among the seven species (Fig. 1c).

Fig. 1
figure 1

Morphological and molecular characterization of huangweibingjun strains. (a) Macro- and microscopic features of Verticillium. dahliae (huangweibingjun) grown on potato Dextrose Agar (b) ML phylogenetic tree characterizing huangweibingjun strain compared to other Verticillium dahliae strains. The red font shows the target strain, while black fonts are the strains that share a close identity with huangweibingjun. The origin of strain in the tree is also shown along a comparison of the Assembly size (AS), N50, L50, GC percentage, and the secreted proteome (SP) count (c) A petal diagram illustrating the shared genes among the seven strains. The core denotes the total number of common genes shared among these fungal strains.

Genome sequencing and assembly of the V. dahliae (huangweibingjun)

The huangweibingjun genome sequencing generated 57.0 Mb raw reads. After quality control analysis, 47.25 Mb clean reads were obtained. The genome comprises six chromosomes (scaffold7561, scaffold7329, scaffold7795, scaffold5491, scaffold5473, and scaffold4511). The start and end positions of each chromosome are shown in Table S1. The assembly statistics also showed that 10,328 scaffolds were ≥ 0 bp, 3228 scaffolds ≥ 500 bp, and 2293 scaffolds ≥ 1 kbp. The longest scaffold was 177,267 bp long. The N50 length (36,867) of huangweibingjun is shorter than its closest neighbor, Getta_Getta (35, 4481), but longer than Gwydir1 A3 (22058) and VdLs.17 (43302). However, the L50 length of huangweibingjun (286) is longer than the Getta_Getta (30) and VdLs.17 (218) strains. Additionally, the GC content (59.85%) of huangweibingjun is higher compared to the Getta_Getta (55.44), Gwydir1 A3 (55.76), s12251 (55.96), VdLa.17 (55.79), and VDG1 (56.15) (Table 1; Fig. 1c). We further assessed the completeness of the assembly, using Illumina reads and BUSCO, achieving approximately 97.57% of Illumina reads mapping to the new assembly and a high level of completeness of 96.12% in the BUSCO analysis.

Table 1 Genome features of Verticillium dahliae (huangweibingjun).

Gene prediction and functional annotation

Following the genome sequencing of V. dahliae, the GeneMark-ES (v4.33) software20 was used to predict 9,901 genes with a total length of 14,120,809 bp and an average gene length of 1,426.2 bp (Table 1). Non-coding RNAs (ncRNAs), which do not carry translatable information but are biologically significant in several other activities, were also annotated. 93% of these ncRNAs were ribosome-bound (rRNA), 7% were sRNA, but 0% were detected for tRNA (Supplementary Fig. S1). The sequence repeats in our V. dahliaegenome, which might result in non-specific gene hits, was masked using the RepeatMasker before gene annotation21. Furthermore, tandem and interspersed repeats, including microsatellite sequences, satellite DNA, and small satellite sequences, are illustrated (Table S2).

Common functional annotation of genes

The predicted genes were subjected to functional annotations using various tools. Among the 9,901 proteins in the V. dahliae genome, 9,831 proteins, representing 99.29%, were annotated on the non-redundant database (NR), while 6,419, representing 64.83%, were annotated on the Swissprot database. The KEGG database also annotated 3,414 (34.48%), KOG 4,898 (49.47%), and 8,072 (81.53%) proteins were annotated on the eggNOG. In addition, 6,298, representing 63.61%, were annotated on the GO database, while only ten (10) genes (0.10%) were annotated on Pfam (Table S3). The annotation of genes also showed that 2,761 genes were common among KEGG, KOG, GO, Swissprot, eggNOG, and NR, while 1,913 genes intersected among KOG, GO, Swissprot, eggNOG, and NR (Fig. 2). The distribution of annotated genes of the top ten species was also performed to determine the species huangweibingjun strain conform with. The results showed that huangweibingjun shared 54.98% gene similarity with V. dahliae VdLs.17, 38.99% with V. longisporum, and 4.57% with V. alfalfae VaMs (Supplementary Fig. S2).

Fig. 2
figure 2

Upset plot illustrating common genes among various gene annotation analyses, including Pfam, KEGG, KOG, GO, Swissprot, eggNOG, and NR. The number of genes annotated in each database is shown at the end of each color bar corresponding to the database. The number of common annotated genes is also illustrated at the top of each.

EggNoG and gene ontology (GO) annotation of genes

A quick view of the functional annotation of gene sequences was studied using the eggNOG mapper. The A-Z evolutionary genealogy of genes (Fig. 3a) illustrates the number of genes annotated under each category. A total of 8,262 genes were annotated. Out of these, 637 genes were annotated for carbohydrate transport and metabolism, 458 genes were identified to be responsible for the posttranslational modification, protein turnover, and chaperones, and no gene target was found among the query for extracellular structures, but 3,896 were annotated as genes with unknown functions (Fig. 3a). The GO annotation was also performed to identify the genes enriched under each ontology term: molecular functions, cellular components, and biological processes. The GO annotation (Fig. 3b) showed that 1,444 genes were enriched under biological regulation, five (5) genes under cell killing, 1,236 were enriched for establishing localization, and 342 genes were annotated under signaling. In addition, 79 protein-binding transcription factors were enriched under molecular function, while 27 genes were enriched under receptor activity (Fig. 3b). The GO annotations revealed several virulence-related genes.

Fig. 3
figure 3

Functional annotation of predicted genes. (a) eggNOG classification of V. dahliae genes. The number of genes under each class is shown in the bubble plots. The vertical axis denotes the number of genes, while the horizontal axis represents eggNOG-classified genes. The interpretation of the alphabetical labeling is shown in the legends. (b) Gene ontology classification of predicted genes in the huangweibingjun genome. The x-axis represents the number of genes, and the y-axis represents the GO terms. Green bars represent molecular function, pink bars denote cellular components, and dark blue bars represent biological processes. The number of enriched genes is illustrated at the end of each bar. (c) KEGG annotation of huangweibingjun predicted genes. The vertical axis represents the enriched KEGG pathways, and the horizontal axis represents the number of enriched genes in each pathway, as illustrated at the end of each bar. Predicted genes enriched under KEGG are categorized into metabolism (green bars), genetic information processing (pink bars), environmental information processing (dark blue bars), and cellular process (magenta bars). (d) Classification of carbohydrate-associated enzyme (CaZy) in huangweibingjun genome. The vertical axis represents the number of annotated genes, while the horizontal axis denotes the annotated CaZy family proteins in the huangweibingjun genome. The number of genes in each CaZy family is shown at the top of each bar.

Kyoto encyclopedia of genes and genomes (KEGG) annotation

The predicted genes were also subjected to the KEGG annotation to identify the metabolic pathways the genes encode. A total of 3,308 genes were annotated into 24 pathways. Among these include carbohydrate metabolism (356 genes), xenobiotic biodegradation and metabolism (80 genes), and glycan biosynthesis and metabolism (80 genes). Fungi secondary metabolites (63 genes) and metabolism of terpenoids and polyketides (36 genes) pathways were also enriched (Fig. 3c). The enriched pathways highlight the potential for successful environmental adaptation and pathogenicity of huangweibingjun.

The carbohydrate-associated enzyme (CaZy)

Carbohydrate-active enzymes (CaZy) break down complex carbohydrates and are responsible for phytopathogenic fungi nutrient acquisition from plant hosts, infection, and colonizing in the host22,23. The carbohydrate-active enzymes database was inferred for the CaZy annotation24 (Fig. 3d). The CaZy database constitutes over 300 protein families, including glycoside hydrolases (GHs), which comprise almost half of the CaZy protein families in genomes and are crucial for the hydrolysis and/or transglycosylation of glycosidic bonds. In the huangweibingjun genome, 265 members were annotated, making GHs the most abundant (44.3%) CaZy in the genome. The glycosyltransferases (GTs) catalyze the formation of glycosidic bonds from phospho-active sugar donors25,26,27. The V. dahliaegenome harbors 92 members of GTs. In addition, thirty-six (36) polysaccharide lyases (PLs) members are present, forming 6% of the CaZy family in the huangweibingjun genome and are responsible for the cleavage of uronic acid-containing glycosidic bond28. The carbohydrate esterases (CEs) enhance GHs activities by eliminating ester-based modifications in mono-, oligo-, and poly-saccharides carbohydrates. Ninety-three (93) members of CEs are present in the V. dahliae genome, forming 15.5% of the CaZy members in the fungus. Carbohydrate-binding modules (CBMs) and the auxiliary activities comprise 5 and 107 members in the V. dahliae genome, respectively (Fig. 3d).

Fungal pathogen virulence factors predicted in V. dahliae (huangweibingjun)

Virulence factors (VF) centrally regulate the interactions between plants and fungi during infection29,30. VFs are mostly chemical signals pathogens generate to enhance colonization in their host (biotrophic stage) or damage the host’s immune responses (necrotrophic stage). To determine the potential virulence factors in V. dahliae, protein sequences were queried on the Database of Fungal Virulence Factors (DFVF)31, and 1,057 proteins were predicted to be involved in huangweibingjun pathogenicity (Table S4). Compared to the closest species, V. dahliae (XJ592) genome contains 10,305 VF, and V. dahliae(XJ511) harbors 10,462 VF proteins17. Fifty-seven (57) VF proteins in huangweibingjun share ≥ 70% similarity to several virulence factors involved in various disease proliferation (Table 2). For example, 8647_t protein encodes a virulence factor (Q96 VA7_FUSOX) with 100% sequences consensus with the virulence factor in Fusarium oxysporum and involved in blight disease. In addition, 6038_t encodes a virulence protein belonging to the small GTPase superfamily Rho family and is 94.5% similar to the M. oryzae virulence factor responsible for causing rice blast infection. We found that 6074_t is 90.87% identical to the leaf spot-inducing protein harbored by Colletotrichum lindemuthianum (Table 2). These results indicate that huangweibingjun has high virulence factors similar to those already reported by VF, which could enhance the proliferation of several plant diseases.

Table 2 Verticillium dahliae (huangweibingjun) virulent factors with ≥ 70% similarity index.

Furthermore, we analyzed the evolutionary antecedence of the predicted virulence proteins with ≥ 50% sequence identities using a neighbor-joining phylogenetic tree. The evolutionary tree cladded the VF proteins into eight groups (I-VIII). Moreover, we found that some predicted VF in huangweibingjun are cladded together, indicating they may have evolved from a common ancestry. For example, the leaf spot virulence factor secreted by Cochliobolus carboum, blight virulence from Fusarium oxysporum, the streaking leaf virulence produced from Botrytis elliptica, and rice blast virulence from Magnaporthe oryzae are evolutionarily related in our newly isolated huangweibingjun (Clad I) (Fig. 4). Similarly, the rot-inducing protein elicited by Gibberella publicaries, the leaf spot proteins from M. grisea and Mycospopharella graminicola, and the blight virulence factor secreted by M. grisea might have evolved from a common ancestry (Clad III) (Fig. 4), suggesting their common virulence history might provide clues to engineering programs towards mitigating the disease they cause.

Fig. 4
figure 4

Evolutionarily analysis of Verticillium dahliae virulent factors (≥ 50% similarity). Protein sequences of Verticillium dahliae virulence factors were clustered on MEGA 11, and the evolutionary tree was constructed using the ML method.

Pathogens interact with their potential host at the cellular, molecular, or organismal level to induce their virulence effect32,33,34. Several proteins are involved in these interactions to enhance the success of pathogens. Therefore, the pathogenicity predictive tool, the Pathogen-Host interaction (PHI) database35, was inferred to predict gene orthologs in huangweibingjun with increased virulence. 2,496 genes were annotated to be involved in PHI (Table S5), of which 33 genes were predicted to harbor a hypervirulence effect. Some notable PHIs factors are effector proteins, pectate lyase, and transcription factors (Table 3).

Table 3 Pathogen-host interaction (PHI) factors predicted in V. dahliae (huangweibingjun).

V. dahliae (huangweibingjun) effectors and PAMPs-inducing proteins

Plant pathogens directly inject effector proteins into their host cells to induce host cell signaling or cause the suppression of the innate immune systems of their host30,36,37. However, plants have evolved intracellular or cell surface receptors for pathogens’ recognition through effector perception termed pathogen-associated molecular patterns (PAMPs)37,38,39,40. The PAMPs-inducing factors in huangweibingjun are highlighted below.

Fungi secretory proteins significantly contribute to their environmental adaptation and successful colonization in their target hosts30,41,42. These secretory proteins harbor domains that stringently involve carbohydrates and protein degradations such as pectate lyase, glycoside hydrolase, protease, etc43,44. For example, FoEG1, a glycoside hydrolase family 12 protein isolated from Fusarium oxysporum, induces apoplastic cell death and promotes infections in cotton and tomatoes45. The secretory proteome of huangweibingjun was analyzed using the signal peptide predictive tool (signalP, version 4.1)46, and 688 proteins were predicted (Table S6).

These effector proteins share sequence similarity with other secreted proteins annotated in other pathogens. To further understand the classification of our huangweibingjun-specfic secreted effector proteins, the protein sequences were extracted and aligned using MUSCLE, and a neighbor-joining phylogenetic tree was constructed using MEGA 11 (Fig. 5). Four members of the pectate lysate genes involved in host tissues degrading and nutrient foraging for pathogen colonization were found among the effectors classified. The Glycoside hydrolase superfamily (Glyco_hydro) proteins are members of the CaZy enzyme family known for their involvement in the hydrolysis of glycosidic linkages mainly responsible for cell wall degradation to enhance pathogen colonization47,48. Based on amino acid sequence similarities, five families comprising 17 proteins classified as glycoside hydrolase 7, 10, 11, 28, and 31 were predicted in V. dahliae. A recent investigation demonstrated that VdCE11 promotes cotton susceptibility by increasing the hydrolase activity of GhAP149. In addition, silencing the hydrophobin 2 protein, VdHP1, from V. dahliae stifled microsclerotia formation, enhanced hydrophilicity, and promoted V. dahliae’ssensitivity to NaCl. In contrast, the wild type induced cell death and promoted the pathogenicity of the fungi50. Our secreted proteome mining identified two homolog proteins encoding hydrophobin 2 (Fig. 5 and Tables S7 and S8).

Fig. 5
figure 5

Phylogenetic classification of Verticillium striae secretory proteome. Each protein’s locus ID, Pfam ID, and short description are used on the leaf. E.g., 7620 t (locus ID) ID: PF01061(Pfam ID) ABC2 membrane (short description). The gradient shows the bootstrap scale.

Secondary metabolic and mycotoxin pathways

Filamentous fungi are crucial producers of mycotoxins and other secondary metabolites, which perhaps aid their exploitation for niche by fighting off competitors. The metabolic pathways analysis of V. dahliae (huangweibingjun) showed a potential for aflatoxin biosynthesis as three proteins were identified to enrich the aflatoxin synthetic pathway (3244_t, 4106_t, 6737_t) (Supplementary Fig. S3a). Whereas 36 genes were predicted to be responsible for terpenoids and polyketides biosynthesis in huangweibingjun (Supplementary Fig.s S3b and S4), 23 genes were found in the genome of M. grisea51. Microbial niche exploitation has also been suggested to involve efficient detoxification of xenobiotic compounds to colonize a toxic environment by depending on carbon, sulfur, phosphorus, and nitrogen. A further prediction of genes showed that huangweibingjun could exhibit cosmopolitan characteristics as 144 genes were predicted to encode Xenobiotics biodegradation and metabolism.

Furthermore, glycine, serine, and threonine metabolism influence fungi growth and pathogenicity, and enhanced levels of these metabolites promote fatty acid biosynthesis and the TCA cycle for fungal nutrient provision52. Therefore, their potential metabolism in huangweibingjun could influence their adaptation to a broad spectrum of niches (Supplementary Fig. S5a). Cysteine and methionine are sulfur-containing amino acids crucial to protein synthesis and cell life. For instance, the metabolism of these metabolites in Fusarium graminearum is regulated by FgMet3 and FgMet14for vegetative growth, sexual development, pigment formation, and penetrability in the host53. The success of pathogens relies on efficient reproduction. Therefore, the cysteine and methionine enrichment in huangweibingjun may influence its pathogenicity (Supplementary Fig. S5b).

Discussion

Cotton production is constrained by several biotic factors54,55. In recent times, V. dahliae, arguably the most devastating pathogen in cotton production, was estimated to exert a 32.49% loss in cotton yield in China in 202156. Although the genome of this fungus is sequenced, enhancing our understanding of its pathogenicity, the evolution of several strains harboring varying pathogenicity effects poses limitations to conquering this V. dahliae disease in plants. This study presents the genome characteristics of a novel strain, huangweibingjun, and its distinguishing features with its closer and distant neighbors. The secretory proteome of huangweibingjun is highly diverse and includes a protein family that aids in its pathogenesis. For example, four members (3, 7, 11, and 28) of the glycosyl hydrolase family proteins were identified in the huangweibingjun strain. Glycosyl hydrolase family 28, in particular, have been functionally elucidated as extracellular proteins that hydrolyze glycosidic bonds in pectin to enhance the delivery of the virulence factors of pathogenic fungi57. Additionally, VdGH7a, a member of GH7, exhibits hydrolytic activity against 1,4-β-glucan and induces cell death in N. benthamiana leaves58.

The cell walls of plants composed of cellulose, hemicellulose, pectin, etc., are the external protective barrier to foreign invasion59. Successful pathogens overcome this barrier by secreting cell wall-degrading enzymes such as pectate lyase, cutinases, pectinases, etc., to digest these carbohydrates23. As previously reported, all these proteins were predicted in the huangweibingjun genome, suggesting its potential pathogenicity60. Brito, et al.61 reported that Xyn11 (xylanase), a GH11 protein isolated from Botrytis cinerea, induces cell death in leaves and enhances the virulence of B. cinerea. In V. dahliae, two GH12 proteins, VdEG1 and VdEG3, were identified as cellulase-producing protein that triggers PAMPs, cause cell death, and upregulate immune responses in N. benthamiana37,62. Furthermore, VdPEL1, a pectate lyase, and VdCUT11, a cutinase, enhance the virulence of V. dahliae and activate PAMP activities in plants63,64.

The potential of huangweibingjun to elicit some secondary metabolites was also predicted. These secondary metabolites are categorized into terpenoids, polyketides, non-ribosomal peptides, and shikimic acid-derived compounds65. Whereas the biosynthesis of these metabolites was predicted in the huangweibingjun genome, previous findings suggest fungi secondary metabolites are responsive to stressors such as oxidative stress66. Mycosporines and polyol metabolites secreted by fungi are also pathogenicity factors and enhance fungi resistance against abiotic stresses, including UV and temperature67. Similar to Aspergillus and Fusariumspps., huangweibingjun could be a mycotoxin producer68,69, as three genes encoding aflatoxin biosynthesis were predicted.

The detailed genome characteristics of huangweibingjun provide a crucial foundation for future research on mitigating its impact and understanding its disease-causing effects in cotton plants.

Materials and methods

Isolation and identification of V. dahliae (huangweibingjun)

Naturally-infested cotton roots were sterilized in 75% ethanol for 10 min, rinse in distilled water 5 times and inoculated on potato dextrose agar plates for 48 h at 28 °C. Mycelia were harvested and subcultured on fresh PDA plates until pure cultures were obtained. Pure cultures were allowed to grow for seven days before they were microscopically observed using a light microscope. The identity of the isolate was determined through genomic DNA extraction using the Sangon DNA extraction kit and gene amplification of the Internal Transcribed Spacer (ITS) region, using primer pairs ITS5: GGAAGTAAAAGTCGTAACAAGG and ITS4: TCCTCCGCTTATTGATATGC, as described70.

Evolutionary analysis of V. dahliae (huangweibingjun)

The evolutionary antecedent of huangweibingjun was inferred by aligning protein sequences with orthologous single-copy genes for all species. The ParaAT (version 20)71and the RAxML72 were used to construct the ML evolutionary tree based on the General Time Reversible (GTR) and the GAMMA-distributed model.

Library construction and sequencing

The genomic DNA from huangweibingjun was extracted using a kit and protocol from Omega Fungal DNA Kit D3390-02. The genomic DNA was subjected to electrophoresis quantification. The quantified DNA samples were subsequently fragmented into 350 bp by Covaris, and the sequencing libraries were built using the TruSeq DNA LT Sample Prep kit. The DNA fragments underwent end repair, A-tailing, sequencing adapters, purification, PCR amplification., and finally, the library construction. After the library is qualified, the sequencer was used for double-end sequencing.

Genome assembly

Contig assembly

The de Bruijn graph approach was used to assemble the genome based on the generated paired-end sequencing data73,74. Sequencing reads from all small-fragment libraries were trimmed into shorter sequences, and de Bruijn graphs were constructed based on their overlapping relationships. To simplify the de Bruijn graph, branches that were unexpendable or had low coverage, primarily due to sequencing errors, were removed. Read alignment information was used to resolve bifurcations caused by repeat regions, and a random selection strategy was employed to merge a limited number of heterozygous sites. Despite simplification, the de Bruijn graph retains numerous unresolved fork sites, prompting sequence fragmentation at each fork to generate the initial contigs73,74.

Scaffold assembly

The reads obtained from all library sequencing were aligned back to the preliminary contigs, and the contigs were assembled into scaffolds using the ligation relationship between the reads and the insert size information73,74.

Raw data processing and quality control

Quality control analysis on raw read was performed using the Trimmomatic software75. The sequencing data with low-quality reads, ambiguous bases, and sequence adaptors were removed. The clean reads obtained were used for further downstream analysis.

Estimation of genome size by K-mer method

The consistency of the sequenced genome size to that of the predicted genome sizes, repeat structure, heterozygous rate, and sequencing depth was predicted using the K-mer analysis76.

Gene prediction and functional annotation

The Illumina-generated data were subjected to downstream bioinformatics analysis. The fungi genome is rich in repetitive sequences and transposable elements; therefore, the GeneMark-ES (v4.33) tool was applied to predict the genes in the Huangweibingjun genome20. Depending on their classes, different predictive tools were also used to predict the non-coding RNA (ncRNA). For tRNA, the tRNAscan-SE (v1.3.1) tool was used77, while rRNA and sRNA in the huangweibingjun genome were predicted using RNAmmer (v1.2) and Rfam (v10.0) tools, respectively78,79. Further, we predicted the repetitive nucleotide sequences in the huangweibingjun genome using the RepeatMasker (v4.0.7) software21. The prophages in the genome of huangweibingjun were predicted using the PhiSpy (v2.3) software80.

The sources of sequence redundancy (HTGS, EST, GSS, STS) were removed using the non-redundant database (https://www.ncbi.nlm.nih.gov/). The SwissProt functional annotation was performed using the online tool http://www.uniprot.org/. Further, the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (http://www.genome.jp/kegg/pathway.html) was also used to annotate the enriched pathway among the predicted genes in the huangweibingjun genome. The EuKaryotic Orthologous Groups, or KOGs, functional annotation of genes was performed using the online tool at https://www.creative-proteomics.com/services/kog-annotation-analysis-service.htm. To further understand the evolutionary histories of genes and their functions, the EggNOG annotation was performed using the online tool http://eggnog.embl.de/. The Gene Ontology annotation of genes classified predicted genes into three ontology terms, including biological processes, molecular functions, and cellular components, using the GO term finder.

The http://www.cazy.org/database was also used to analyze and classify the carbohydrate-associated enzymes in the huangweibingjun genome24. The annotation of the virulence factors harbored by huangweibingjun was also performed using the Fungal Pathogen Virulence Factors (DFVF) database31. In contrast, the signal peptide predictive tool SignalP (v4.1) was used to predict the secretory proteome46. In addition, the pathogen-host interaction factors were annotated using the Pathogen-host interaction (PHI) database81.