Introduction

Chemical derivatization is a common practice in analytical chemistry to enable the analysis of compounds that otherwise cannot be easily separated or detected using chromatographic or spectroscopic technologies1,2,3. When deployed into liquid chromatography-mass spectrometry (LC-MS) analysis that has been widely used in biomarker discovery4,5,6, untargeted metabolomics7,8,9 and environmental research10,11,12, high-throughput identification of chemically derived molecules (CDMs) presents an unmet challenge in data mining and molecule identification. This is because typical pipelines to interpret MS and tandem MS (MS/MS) data rely on spectrum matching in libraries (Fig. 1a), and only molecules for which a reference sample has been measured and its spectrum has been recorded can be confidently annotated13,14. However, there are very limited number of CDM reference spectra and there is no specialized spectral library for CDMs so far. Moreover, CDMs often generate significantly different MS/MS spectra from the original molecules. For instance, low similarity with weighted cosine similarity (WCS) value of only 0.02 ± 0.01 (median, 95% CI) was found between the reference spectra of 231 underived molecules recorded in the MassBank of North America (MoNA), MassBank and Human Metabolome Database (HMDB) and the experimentally obtained spectra of their dansylated forms (Fig. S1a). Consequently, non-targeted analysis of CDMs often results in low confidence levels (level 4–5)15,16. Therefore, there is an urgent need to build specialized spectral libraries for CDMs but experimentally measuring MS/MS spectra of CDMs at a repository scale remains unrealistic.

Fig. 1: Barrier lies in the liquid chromatography-mass spectrometry (LC-MS) based non-targeted analysis of chemically derived molecules (CDMs).
figure 1

a A schematic workflow of molecule annotation in LC-MS based non-targeted analysis of CDMs. Spectral library for CDMs is the key to fast and high confident molecule annotation from massive data. b Current predicting tools are not designed for predicting CDMs spectrum, and the predicted spectra with bad quality cannot be used for building specialized in silico spectral library. c The development of specialized in silico spectral library for CDMs by DeepCDM. Transfer learning can cost-effectively reuse a model for new purpose43,55,56, thus can establish new predicting model specialized for CDMs with improved accuracy, allowing the rapid construction of in silico spectral library. The colored dots represent original molecules and the colored stars are chemical derivatizing reagents.

Recently, in silico methods were developed, allowing rapid prediction of MS/MS spectra in a scalable manner (Fig. 1b). Current predicting tools are developed based on quantum chemistry simulation or machine learning. The quantum chemistry approaches such as quantum chemistry electron ionization mass spectra (QCEIMS) are cost intensive, thus hamper efficiency and application17,18,19. Machine learning approaches are more efficient17. However, existing machine learning tools including neural electron-ionization mass spectrometry (NEIMS)20 and competitive fragmentation modeling for metabolite identification (CFM-ID)21,22,23 are general for underived molecules, offering low predictability for CDMs (Fig. S1b). Herein, we describe a deep learning approach that enables accurate prediction of electrospray ionization tandem mass spectrum (ESI-MS/MS) for CDMs, named as DeepCDM. Rather than developing a new predicting tool from scratch, DeepCDM focuses on re-training existing algorisms using a small set of experimentally obtained CDM spectra via transfer learning. We demonstrate that DeepCDM is accurate and scalable, enabling the establishment of a dansylation-specialized model, Dns-MS, which improves upon the generic tools, such as NEIMS and CFM-ID, in the compound-to-spectrum (C2MS) benchmark of dansylated molecules, and allowing the rapid construction of a specialized in silico spectral library containing 294,647 MS/MS spectra for dansylated molecules (Fig. 1c). Moreover, we also demonstrate that DeepCDM is highly versatile, which can be generalized for predicting MS/MS spectrum of other classes of CDMs by switching the training set of CDMs. We further demonstrate that DeepCDM is fully adaptable with non-targeted analysis, which enables the rapid discovery of new hazardous chemicals in authentic wastewaters from leather industry.

Results

DeepCDM workflow for dansylated molecules

As a proof-of-concept, we aim to develop and validate DeepCDM for predicting MS/MS spectra of dansylated molecules. Dansylation is one of the most widely used chemical derivatization approaches in LC-MS based non-targeted analysis9,12,24, but there are only 315 MS/MS spectra of dansylated molecules available in MYCompoundID25 and 6 in MassBank26. To accurately predict MS/MS spectra of dansylated molecules, DeepCDM retrained the multilayer perceptron (MLP) of NEIMS into a dansylation-specialized MS/MS predicting model (Dns-MS). The goal of Dns-MS is to predict collision-energy-merged ESI-MS/MS spectra of dansylated molecules because such spectra are most often acquired in non-targeted analysis27,28,29,30,31,32. As NEIMS is an electron ionization (EI) spectrum predicting model20, we first trained the MLP of NEIMS using 6790 ESI-positive MS/MS spectra collected from MoNA, so that this pre-trained model (ESI-MLP) was suitable for ESI-MS/MS spectrum prediction (Table S1, S2). Additionally, multiple MoNA spectra of one molecule at different collision energies were merged to generate collision-energy-merged spectra for training. ESI-MLP was further fine-tuned into Dns-MS, where a training set containing 507 experimentally obtained spectra of dansylated molecules synthesized by labeling chemicals with amine, hydroxyl, and carboxyl groups using dansyl chloride (DnsCl) and/or dansyl hydrazine (DnsHz) (Fig. 2a) was employed. The original molecular weights of the training set were ranged from 59.0735 to 665.6982 Da, containing 62 classes of molecules such as benzene and substituted derivatives, carboxylic acids and derivatives, and phenols. The molecules had an average pairwise Tanimoto similarity of 0.11 on their Extended-Connectivity Fingerprints (ECFP4) (Fig. S2). For each dansylated molecule, spectra were collected at multiple collision energies ranging from 10 eV to 50 eV and merged automatically by HESI Q-Exactive Plus Orbitrap (Thermo Scientific, USA).

Fig. 2: DeepCDM workflow for dansylated molecules and model evaluation.
figure 2

a The illustration of DeepCDM to establish a dansylation-specialized spectrum predicting model (Dns-MS) for predicting collision-energy-merged electrospray ionization tandem mass spectra (ESI-MS/MS) of dansylated molecules. The multilayer perceptron (MLP) of neural electron-ionization mass spectrometry (NEIMS) was first trained using ESI-positive MS/MS spectra collected from the MassBank of North America (MoNA) into a pre-trained model, ESI-MLP, which was further fine-tuned into Dns-MS with a small training set of experimentally obtained spectra of dansylated molecules. The colored dots represent original molecules and the colored stars are dansylating reagents. b Weighted cosine similarity (WCS) comparisons between predicted spectra and experimental spectra of 167 dansylated molecules in a hold-out test set computed by Dns-MS and ESI-MLP. c WCSs of Dns-MS predicted spectra for dansylated molecules with amine, hydroxyl and carboxyl in the test set. Amines: 0.66 ± 0.03, n = 51; hydroxyls: 0.69 ± 0.03, n = 74 and carboxyls: 0.72 ± 0.04, n = 42 (median, 95% CI). No significant difference was observed between the 3 groups (p = 0.1467) through one-way analysis of variance (ANOVA). d Compound-to-spectrum (C2MS) capability of Dns-MS was benchmarked on hold-out dansylated molecules in the test set with generic tools. WCSs of predicted spectra generated by NEIMS, CFM-ID 4.0 and Dns-MS for dansylated molecules in the test set matching with the 167 experimental spectra. NEIMS: 0.38 ± 0.01, CFM-ID 4.0: 0.46 ± 0.02 and Dns-MS: 0.69 ± 0.02 (median, 95% CI). Dns-MS had significantly enhanced WCSs in comparison with CFM-ID 4.0 (p = 1.20e-38) and NEIMS (p = 8.19e-78). Unpaired two-tailed t-tests were performed to evaluate statistical differences between 2 groups (n = 167 in every group). e The predicted spectra of dansylated 4-isopropyl-3-methylphenol generated by NEIMS, CFM-ID 4.0 and Dns-MS, matching with its experimental spectrum. ns: p > 0.05, ****p ≤ 0.0001. Source data are provided as a Source Data file.

Fine-tuning at the last layer of ESI-MLP showed slightly better performance over fine-tuning at all layers and was thus employed as an optimal fine-tuning strategy for Dns-MS (Fig. S3). Besides Dns-MS, we also developed Dns-MS-β by directly training the MLP of NEIMS using a combined set of 6790 MoNA spectra and 507 experimental spectra of dansylated molecules (Fig. S4). Nevertheless, Dns-MS demonstrated much better predictability over Dns-MS-β (Figs. S4, S5), suggesting that fine-tuning using even a small set of spectra of CDMs is critical to ensure the success of DeepCDM.

Evaluation of DeepCDM for dansylated molecules

The compound-to-spectrum (C2MS) capability of Dns-MS was examined using a hold-out test set containing 167 experimental ESI-MS/MS spectra of dansylated molecules. Weighed cosine similarity (WCS) between Dns-MS predicted spectra and experimental spectra in the test set was determined to evaluate the predictability of Dns-MS (Table S3). Although fine-tuned only with a small-sized training set of dansylated molecules, much improved predictability of Dns-MS (median WCS = 0.69 ± 0.02, 95% CI, n = 167) was achieved over ESI-MLP (median WCS = 0.37 ± 0.03, 95% CI, n = 167) (Figs. 2b, S4S5). Moreover, Dns-MS showed equally high spectrum predictability for dansylated molecules synthesized through labeling chemicals with amine, hydroxyl and carboxyl groups (Fig. 2c), suggesting that Dns-MS can be broadly used with no structural bias.

In a 4-fold cross-validation, Dns-MS demonstrated robust performances for 4 randomly split dansylated dataset (Fig. S6a). The Dns-MS predicted spectrum and experimental spectrum of every dansylated molecule in the test set was also matched with the experimental spectrum of a structurally most similar molecule in the training set. The predicted spectra of query molecules had low similarity with the experimental spectra of their structurally similar molecules with a median WCS of 0.27 ± 0.02 (95% CI, n = 167). Similar observation was made between the experimental spectra of query molecules in the test set with those of structurally similar molecules in the training set (median WCS = 0.22 ± 0.03, 95% CI, n = 167, Fig. S6b, c). That confirmed the good predictability of Dns-MS for dansylated molecules that haven’t been exposed to the model.

Then, we compared the performance Dns-MS with two widely used spectrum predicting tools, NEIMS and CFM-ID 4.0 (web server), using the test set of 167 experimental spectra. Figure 2d and Fig. S7 revealed better predictability of Dns-MS with a median WCS value of 0.69 over two other machine learning-based tools with median WCS values of 0.38 for NEIMS and 0.46 for CMF-ID 4.0, respectively. Figure 2e shows the predicted spectra of a representative molecule, dansylated 4-isopropyl-3-methylphenol using Dns-MS, NEIMS, and CMF-ID 4.0, as well as matching with the experimental spectrum of this dansylated standard. NEIMS had low predictability of dansylated 4-isopropyl-3-methylphenol with over-prediction of small fragments (WCS = 0.36). CFM-ID 4.0 produced a clean spectrum but with several mismatches (WCS = 0.54). By contrast, the predicted spectrum by Dns-MS showed an excellent match with the standard, which returned a WCS value of 0.94. The spectra matching results of all 167 dansylated molecules were summarized in Supplementary Data-01, and 73.65% of predicted spectra by Dns-MS were of high quality with WCS values over 0.6.

To evaluate the capability of DeepCDM to selectively enhance the MS/MS spectrum predictability of CDMs, we harnessed Dns-MS to predict spectrum for underived molecules in the MoNA dataset (n = 6790). As shown in Fig. S8, the predictability of Dns-MS was drastically decreased for underived molecules with a median WCS of 0.38, confirming that our DeepCDM approach is specialized for desired CDMs.

Versatility of DeepCDM

To demonstrate that DeepCDM is a generalizable approach for different classes of CDMs, we also established a spectrum predicting model for benzoylation, another widely used chemical derivatization method for amine- and hydroxyl-containing molecules33,34,35. This model, termed Bz-MS, was first trained using the MoNA dataset containing 6790 merged ESI-MS/MS positive spectra of underived molecules, and then fine-tuned using a training set containing 503 experimental spectra of benzoylated molecules (Fig. S9). Evaluated with a hold-out test set containing 168 benzoylated molecules, Bz-MS achieved a median WCS value of 0.74 ± 0.03 (95% CI, n = 168) in predicting spectra for benzoylated molecules, which was significantly improved from the pre-trained model, ESI-MLP (median WCS = 0.53 ± 0.02, 95% CI, n = 168) (Fig. S10a). It also outperformed NEIMS and CFM-ID 4.0 with over 1.5-fold increase of median WCS in a C2MS benchmark (Figs. S10, S11). Moreover, there was no bias in spectrum prediction for bezoylated amine- and hydroxyl-containing molecules (Fig. S10b). The robustness of Bz-MS was confirmed in a 4-fold cross-validation with 4 randomly split benzoylated datasets (Fig. S12). Bz-MS was also a highly specialized tool for benzoylated molecules with predictability significantly better than underived molecules and dansylated molecules (Fig. S13). Collectively, our success in establishing Bz-MS demonstrates that DeepCDM is a highly versatile approach generalizable from one class of CDMs to another.

DeepCDM-based spectral library for dansylated molecules

Haven demonstrated that DeepCDM is a highly robust and versatile approach to enable MS spectrum prediction models for desired class of CDMs, we next aim to establish a database that specialized for a given class of CDM. Here, a dansylation-specialized spectral library (DnsBank) was established, which contained 294,647 ESI-MS/MS spectra predicted by Dns-MS. To construct DnsBank, we first screened 178,629 molecules from a public chemical database, Distributed Structure-Searchable Toxicity (DSSTox)36, with three criteria, including (1) molecules that were composed by 11 elements: H, C, O, N, P, S, Cl, F, Br, I, Si; (2) molecules that were not ionized or containing ionized adducts; (3) molecules that contained amine, hydroxyl or carboxyl groups. The 178,692 molecules then underwent virtual dansylation reactions, where compounds containing amine or hydroxyl groups were derived using DnsCl and those containing carboxyl groups were derived using DnsHz. The virtual reactions were achieved by converting the Simplified Molecular-Input Line-Entry System (SMILES) of each molecule through “SMART reaction” by RDKit (http://www.rdkit.org/) (Fig. 3a). A total of 294,647 dansylated molecules were generated through virtual reactions, the spectra of which were then predicted using Dns-MS and stored in DnsBank. As such, DnsBank contained 50,395 spectra for dansylated amines (17.10%), 110,024 spectra for dansylated hydroxyls (37.34%), 116,348 predicted spectra for dansylated carboxyls (39.49%), and 17,880 spectra for didansylated molecules (6.07%) (Fig. 3b and Table S4).

Fig. 3: Dansylation-specialized in silico spectral library (DnsBank) constructed by DeepCDM.
figure 3

a Structural information of molecules was selected and extracted from a public chemical database, Distributed Structure-Searchable Toxicity (DSSTox), which were then virtually derived. The spectra of these virtually dansylated molecules were predicted by the dansylation-specialized spectrum predicting model (Dns-MS) for constructing DnsBank. b The spectral composition of DnsBank. 17.10% spectra were belonged to dansylated amine-containing molecules, 37.34% for dansylated hydroxyl-containing molecules, 39.39% for dansylated carboxyl molecules and 6.07% for didansylated molecules. c Spectrum-to-compound (MS2C) capability of DnsBank was benchmarked with public databases and compound identification tools using the hold-out test set. The top-k accuracy of annotations from 167 query spectra in the test set through searching in DnsBank, PubChem by SIRIUS 4, and 6 libraries by CFM-ID 4.0, including Human Metabolome Database (HMDB), MassBankJP/MassBankEU, the MassBank of North America (MoNA), and the predicted libraries generated by CFM-ID 4.0 from ChEBI, DSSTox and STOFF-IDENT. 28.14% of the test set were correctly annotated at top 1 output by DnsBank, 63.47% were in top 5, 76.05% were in top 10 and 91.02% were ranked in top 25. CFM-ID 4.0 failed to annotate any dansylated molecules in the test set, although 3 molecules were found in its libraries by manual checking. SIRIUS 4 correctly annotated 7 dansylated molecules (4.19%) in top 5 output and 8 (4.79%) in top 25. In total, 15 dansylated molecules in the test set were found in PubChem by manual checking, 8 of them were annotated by SIRIUS 4. d The top-k accuracy of the 167 query spectra in test set for molecules containing amine, hydroxyl and carboxyl matched in DnsBank. Amine: 37.25% at top 1, 74.51% in top 5, 86.27% in top 10 and 92.16% in top 25. Hydroxyl: 25.68% at top 1, 60.81% in top 5, 71.62% in top 10 and 91.89% in top 25. Carboxyl: 21.43% at top 1, 54.76% in top 5, 71.43% in top 10 and 88.10% in top 25. Source data are provided as a Source Data file.

The spectrum-to-compound (MS2C) capability of DnsBank was evaluated against the 167 experimental spectra of dansylated molecules using WCS as the measure. DnsBank retuned correct annotation rates of 63.47% (106/167) in top 5 output and 91.02% in top 25, respectively (Fig. 3c). The highest correct annotation rate was found for dansylated amines (74.51% in top 5), followed by dansylated hydroxyls (60.81% in top 5) and carboxyls (54.76% in top 5) (Fig. 3d). By contrast, public databases (HMDB, MassBankJP/MassBankEU and MoNA) paired with CFM-ID 4.0 as a compound identification module failed to correctly annotate any dansylated molecules in the test set. Failure was also made when in silico spectral libraries (ChEBI, DSSTox and STOFF-IDENT) generated by CFM-ID 4.0 were used, though 3 dansylated molecules were found in these libraries upon manual checking (Supplementary Data-02). We also compared the searching result of DnsBank with that obtained using SIRIUS 4, which computes possible molecular fingerprints based on the MS/MS spectrum through a fragmentation tree and then searches the predicted fingerprints in chemical databases, such as PubChem37. Despite SIRIUS 4 drastically widens the coverage of chemicals, it only correctly annotated 7 dansylated molecules in top 5 output and 8 in top 25. Among the 159 incorrect annotations, 7 dansylated molecules were found in PubChem but not annotated by SIRIUS 4 and 13 were unable to be computed by SIRIUS (Supplementary Data-02). Collectively, our results demonstrated the superior MS2C performance of DnsBank over other tools for dansylated molecules, demonstrating the power of our DeepCDM for annotating CDMs.

Wastewater analysis using DeepCDM-assisted non-targeted analysis

DeepCDM allows us to expand the annotation of CDMs beyond the space of known molecules, making it possible to explore novel chemical processes. To demonstrate this, we used DeepCDM to facilitate the discovery of new hazardous chemicals in authentic environmental samples.

Anthropogenic organic chemicals have improved the quality of human life but many also pose inherent dangers to human and environment. Leather industry is an important contributor to pollutants emission due to the booming global market of leather goods38. Although industrial wastewater must be treated before emitting into environment, a significant amount of chemicals persist and can be further transformed during the treatment39. Therefore, high-throughput identification of massive organics in wastewater before and after treatments without prior knowledge is the first and foremost step for discovering chemicals that may eventually expose to human and biota.

We explored chemicals in wastewaters of leather industry before and after treatments by collecting and analyzing three types of samples, including untreated wastewater (UW), wastewater treated by active carbon adsorption (ACD), and that treated by ACD and ozonation (ACD/ozonation) (Fig. 4a). All water samples were first dansylated to enhance the retention and ionization of acids, phenolics, and aromatic amines that were heavily involved in the production of leather goods40,41. LC-MS based non-targeted analysis was then performed, where DnsBank was employed for spectrum matching and molecule annotation. Using WCS ≥ 0.6 as a threshold42 since it provided reasonable number of annotations for further analysis (Fig. S14), we tentatively annotated 614 dansylated molecules in UW, 365 in ACD, and 369 in ACD/ozonation with confidence level 3 (Table 1). The average annotation efficiency was determined to be 19.42% (Table S5). A total of 991 unique dansylated molecules were annotated throughout all samples (Supplementary Data-03). As comparisons, SIRIUS 437 returned 34 tentative annotations (average annotation efficiency: 0.56%) and mzCloud (https://www.mzcloud.org/) paired with Compound DiscovererTM 3.3 (CD 3.3, Thermo Fisher Scientific) annotated only 7 dansylated molecules (average annotation efficiency: 0.22%) for the same set of samples (Fig. 4b and Table S4).

Fig. 4: DeepCDM-assisted environmental non-targeted analysis.
figure 4

a Dansylation-specialized in silico spectral library (DnsBank) enabled confidence level 3 annotation of chemicals in liquid chromatography-mass spectrometry (LC-MS) based non-targeted analysis of leather industrial wastewater after dansylation. b Total annotation numbers of 3 wastewater samples obtained from DnsBank spectrum matching, SIRIUS 4 and Compound DiscovererTM 3.3 (CD 3.3) database searching. c Composition of tentative annotations in the untreated wastewater (UW), activate carbon adsorption (ACD) wastewater and ACD/ozonation wastewater obtained from DnsBank. d The uniform manifold approximation and projection (UMAP) of chemicals detected in the 3 wastewater samples. e Identity confirmation of 3-fluoro-5-hydroxybenzonitrile and ent-paroxol by matching tandem mass spectra with the dansylated chemical standards. f Concentration−response curves of Hep G2 cells incubating with 3-fluoro-5-hydroxybenzonitrile and ent-paroxol. Each concentration had 6 biological replicates. Data are presented as mean values ± standard deviation. Source data are provided as a Source Data file.

Table 1 Numbers of tentative annotations in 3 wastewater samples after dansylation

The lower number of annotations in treated wastewater samples by either ACD or ACD/ozonation was consistent with the decreasing chemical oxygen demand (COD) (Fig. 4c and Table 1), suggesting the effective removal of chemicals from wastewater. However, there were 87 tentative annotations detected in all 3 samples, indicating the exist of persistent chemicals stable against ACD and ozonation treatments (Table S6). DeepCDM-assisted non-targeted analysis also revealed significant differences in treatment techniques, where ACD was largely overlapped with UW in uniform manifold approximation and projection (UMAP) analysis but ACD/ozonation showed a significant different profile from the other two samples (Fig. 4d). This result indicated that ozonation made a significant change to chemical compositions in wastewater with 142 unique annotations which were likely to be new transformation products (Table S7).

Of all 991 tentative annotations, there were many known pollutants, such as per- and polyfluoroalkyl substances (PFASs) and halogenated compounds (Table S8 and Fig. S15). We also found 865 chemicals with no toxicity record. A further investigation of this group of chemicals may lead to the discovery of new hazardous pollutants. We selected 4 annotations to confirm their identities using dansylated chemical standards and evaluate their cytotoxicity using Hep G2 cells through cell counting kit-8 (CCK-8) (Table S9). Two compounds, 3-fluoro-5-hydroxybenzonitrile and ent-paroxol, were found to be of high cellular toxicity with the 50% inhibitory concentration (IC50) determined to be 545.9 μM and 494.4 μM, respectively (Figs. 4e, f and S16), thus could be new hazardous pollutants.

Discussion

Current predicting tools are not designed to predict MS/MS spectra of CDMs. Here we introduced the DeepCDM workflow to establish a specialized model for CDMs via transfer learning a generic spectrum predicting model. Using dansylation and benzyolation as model systems, we demonstrated that high-quality MS/MS spectrum prediction could be achieved using a small-sized training set of variable CDMs. Unlike existing MS/MS spectrum predicting tools, such as NEIMS and CFM-ID 4.0, that are generic for all molecules, DeepCDM aims to enhance the spectrum predictability of a more focused group of molecules undergone given chemical derivatizations. Both Dns-MS and Bz-MS are highly specialized to dansylated molecules and benzoylated molecules, respectively, which improved upon generic tools in predicting ESI-MS/MS spectra for CDMs with over 150% enhanced WCS values. The superior performance of DeepCDM is achieved upon transfer learning that converts a generic tool into a highly specialized one for CDMs. For example, although both Dns-MS and Bz-MS were built upon the architecture of NEIMS, each of the two models demonstrated high predictability of its corresponding CDMs rather than the other group. Therefore, DeepCDM is a powerful addition to the current toolbox for MS/MS spectrum prediction. It is also worth noting that DeepCDM is not bound to NEIMS but can be adapted to any other exiting prediction algorithm for addressing specific challenges. Moreover, unlike many current predicting tools that work at a fixed collision energy21,23,43, ESI-MS/MS spectra predicted using DeepCDM are collision-energy-merged, which is more compatible with experimental conditions acquired by LC-MS based non-targeted analysis27,28,29,30,31,32.

Because of the remarkable MS2C capability, DeepCDM facilitates the annotation of CDMs. Current compound identification tools, such as SIRIUS 4, CFM-ID 4.0 compound identification module and CD 3.3 access databases containing limited information of CDMs. Due to the enormous differences between MS/MS spectra of CDMs and spectra of underived molecules, annotating CDMs using these compound identification tools is difficult and low in efficiency. But the highly scalable DeepCDM enabled the construction of a dansylation-specialized in silico spectra library containing 294,647 ESI-MS/MS spectra. We also demonstrated that DeepCDM could be readily employed into non-targeted analysis workflow for high-throughput annotation of CDMs in complex environmental or biological samples.

The power of DeepCDM was further represented by analyzing authentic environmental samples. A total of 991 chemicals were annotated by DeepCDM upon non-targeted analysis with an average annotation efficiency 30 times better than that of SIRIUS 4 and 80 times better than that of CD 3.3. Besides the high annotation efficiency, DeepCDM also demonstrated an unbiased annotation among dansylated amines, hydroxyls, and carboxyls. By contrast, the two commercial tools failed to annotate any dansylated carboxylic acids41. CFM-ID 4.0 was not employed here because its server did not offer batch searching and its docker could not access public databases.

Once integrated to non-targeted environmental analysis, the highly efficient and unbiased nature of our DeepCDM ensures comprehensive monitoring of chemical processes and high-throughput discovery of new hazardous pollutants. Here DeepCDM reveals that ACD and ozonation are effectively strategies to remove organic pollutants. For instance, 2-[perfluoro(propoxy)ethoxy]acetic acid, an fluorinated alternative for polyfluorinated carboxylic acid (PFCAs)44,45, was removed by ACD and ozonation from leather industrial wastewater. As no removal techniques can eliminate all chemicals, many compounds, such as the carcinogenic 2,3,3,4,4,5,5-heptafluoropentanoic acid45, remain persistent during ACD and ozonation. In further investigation of new pollutants, we annotated and confirmed the identity of ent-paroxol as a persistent chemical, and its toxicity to Hep G2 cells suggested it could be a new hazardous pollutant. In addition to persistent pollutants, DeepCDM also reveals significant numbers of transformation products upon ozonation, which could direct an important source of potential pollutants. To name a few, 3-fluoro-5-hydroxybenzonitrile, detected in the ACD/ozonation wastewater and shown significant cellular toxicity, could contribute to the total toxicity of industrial effluent to human and environment as a toxic transformation product.

Methods

DeepCDM

First, spectra were downloaded from public MassBank of North American (MoNA, ucdavis.edu) and filtered with criteria: (1) positive mode, (2) ESI ionization, (3) MoNA rank >4. Then the downloaded spectra was processed to merge all fragments produced at different collision energy for one molecule37,46. That produced a MoNA dataset containing 6790 merged electrospray ionization tandem mass spectra (ESI-MS/MS) of 6790 underived molecules, which was used to train the multilayer perception (MLP) of NEIMS20. In this step, MoNA dataset was chosen because it is an open database accessible to researchers across diverse research fields. Second, 585 chemical standards were labeled by dansyl chloride (DnsCl) and/or dansyl hydrazine (DnsHz) to produce 674 dansylated molecules, whose experimental merged spectra were acquired using HESI Q-Exactive Plus Orbitrap (Thermo Scientific, USA) at collision energies ranged from 10 to 50 eV. 507 experimental spectra of dansylated molecules were used for fine-tuning the trained MLP as the training set with 10% for validation and 167 spectra were holdout from the training set for model evaluation (test set). Dataset was randomly split without considering structural similarity of dansylated molecules. The MLP was fine-tuned by freezing all but the last layer in Dns-MS.

To modify DeepCDM for benzoylated molecules, the trained MLP by MoNA dataset was fine-tuned using 503 experimental spectra of benzoylated molecules with 10% for validation. A hold-out test set containing 168 experimental spectra of benzoylated molecules was used for model evaluation. The benzoylated dataset was collected on HESI Q-Exactive Plus Orbitrap at collision energies ranged from 10 to 50 eV, using 671 benzoylated molecules labeled by benzoyl chloride (BzCl). Dataset was randomly split without considering structural similarity of benzoylated molecules. The MLP was also fine-tuned by freezing all but the last layer in Bz-MS.

Model evaluation

3 metrics were tested for model evaluation: weighted cosine similarity (WCS), cosine similarity and spectra entropy. WCS was calculated using Eq. (1)47:

$${{{\rm{Weighted}}}} \, {{{\rm{Cosine}}}} \, {{{\rm{Similarity}}}} \left({I}_{{{\rm{q}}}},\, {I}_{{{\rm{l}}}}\right)=\frac{ {\sum}_{k=1}^{{M}_{\max }}{m}_{k}^{0.5}\,{I}_{{{{\rm{q}}}}k}^{0.5}\cdot {m}_{k}^{0.5}\,{I}_{{{{\rm{l}}}}k}^{0.5}}{\sqrt{{\sum }_{k=1}^{{M}_{{{\rm{q}}}}}{({m}_{k}^{0.5}\,{I}_{{{{\rm{q}}}}k}^{0.5})}^{2}}\sqrt{{\sum }_{k=1}^{{M}_{{{\rm{l}}}}}{({m}_{k}^{0.5}\,{I}_{{{{\rm{l}}}}k}^{0.5})}^{2}}}$$
(1)

Cosine similarity was calculated from Eq. (2)47:

$${{{\rm{Cosine}}}} \, {{{\rm{Similarity}}}} \left({I}_{{{\rm{q}}}},\, {I}_{{{\rm{l}}}}\right)=\frac{{\sum }_{k=1}^{{M}_{\max }}{m}_{k} \,{I}_{{{{\rm{q}}}}k}^{0.5}\cdot {m}_{k}\,{I}_{{{{\rm{l}}}}k}^{0.5}}{\sqrt{{\sum }_{k=1}^{{M}_{{{\rm{q}}}}}{({m}_{k}\,{I}_{{{{\rm{q}}}}k}^{0.5})}^{2}}\sqrt{{\sum }_{k=1}^{{M}_{{{\rm{l}}}}}{({m}_{k} \, {I}_{{{{\rm{l}}}}k}^{0.5})}^{2}}}$$
(2)

Where mk was the mass-to-charge ratio (m/z) of fragment; Iqk and Ilk was the intensity of fragment (m/z = k) in the query spectrum and in the reference spectrum; Ml and Mq were the largest nonzero indices of Iq and Il; Mmax was the largest index of Ml and Mq.

Spectra entropy S was calculated from all ion intensities Ip by Eq. (3)48:

$$\,S=-{\sum}_{{{\rm{p}}}}{I}_{P} \, {{{\mathrm{ln}}}}\,{I}_{P}$$
(3)

Consistent results were found among the 3 metrics in the comparison of 3 models with slightly better WCSs, using the hold-out test set of 167 dansylated molecules. Therefore, WCS was selected in this study to evaluate the predictability of models.

Virtual reaction and library building

A structural dataset was downloaded from Distributed Structure-Searchable Toxicity (DSSTox, https://epa.figshare.com/articles/dataset/Chemistry_Dashboard_Data_DSSTox_SDF/5491516, accessed on March 31, 2022). The DSSTox dataset were screen for chemicals with amine, hydroxyl and carboxyl groups. Then, the structures of dansylated molecules were computed by RDKit (http://www.rdkit.org/) using a derivatization Python script “SMART reaction” and output in Simplified Molecular Input Line Entry System (SMILES). The “SMART reaction” simulated dansylation to reacting groups using defined templates: DnsCl for amine and hydroxyl, DnsHz for carboxyl. The dansylated SMILES structures were exported in SDF files, then input to Dns-MS to predict collision-energy-merged ESI-MS/MS spectra to 1 Da resolution. The predicted spectra were accumulated for DnsBank (Table S4).

Benchmarks

C2MS benchmarks between Dns-MS, Bz-MS, NEIMS and CFM-ID 4.0 (web server) were performed on hold-out test sets. NEIMS was acquired from ref. 20. and CFM-ID 4.0 (web server, http://cfmid.wishartlab.com/predict) was used for spectrum prediction.

MS2C benchmark was performed between DnsBank and other public spectral libraries accessed by CFM-ID 4.0 compound identification module (http://cfmid.wishartlab.com/identify) and PubChem accessed by SIRIUS 4 on the hold-out test set containing 167 dansylated spectra. In silico spectra libraries of ChEBI, DSSTox and STOFF-IDENT predicted by CFM-ID 4.0 and public spectral libraries including HMDB, MassBank and MoNA were used by CFM-ID 4.0 compound identification module for annotating dansylated molecules through spectrum matching. ESI was selected as the spectrum type, ion mode was positive and adduct type was set to [M + H]+ in CFM-ID 4.0, with 10 ppm mass tolerance and DotProduct as the scoring function. For molecule annotation by SIRIUS 4, parameters were set to be instrument: orbitrap, database: PubChem, possible ionization: [M + H]+. CSI:FingerID was used for fingerprint prediction and structure database search.

Dansylation of chemical standards

For DnsCl reaction, amine- or hydroxyl-containing chemicals were dissolved in 50% acetonitrile to a concentration of 10 mM individually. 30 μL standard solution was mixed with 15 μL acetonitrile, 15 μL sodium carbonate/sodium bicarbonate buffer (0.5 M, pH 9.4) and 30 μL 20 mg·mL−1 DnsCl. The reaction solution was shaken at 60 °C for 60 min before quenching by 10 μL 250 mM sodium hydroxide. The excess sodium hydroxide was neutralized by 50 μL 425 mM formic acid.

For DnsHz reaction, carboxyl chemicals were dissolved in acetonitrile to 10 mM individually. 25 μL standard solution was mixed with 25 μL 50 mM 1-ethyl-3(3-(dimethylamino)propyl)carbodiimide, 25 μL 10 mM 1-hydroxy-7-azabenzotriazole in 2-(N-morpholino)ethanesulfonic acid buffer and 25 μL 10 mg mL−1 DnsHz. The reaction solution was shaken at 25 °C for 90 min before quenching at −80 °C for 10 min.

Benzoylation of chemical standards

For benzoyl chloride (BzCl) reaction, amine- or hydroxyl-containing chemicals were dissolved in 50% acetonitrile to a concentration of 10 mM individually. 200 μL standard solution was mixed with 100 μL sodium carbonate buffer (0.1 M, pH = 9.5), 100 μL 2% BzCl and 1 μL formic acid at room temperature for 5 min.

MS analysis

MS and tandem MS (MS/MS) spectra of dansylated molecules were acquired by HESI Q-Exactive Plus Orbitrap (Thermo Scientific, USA) in positive mode. Precursor ions were selected for MS/MS scan with ±0.25 Da mass tolerance. 10 MS/MS scans were accumulated to generate a MS/MS spectrum with collision energies ranged from 10 to 50 eV.

Wastewater treatments

The leather industrial wastewater was first treated with activated carbon adsorption (ACD) and then ozonation. 100 mL leather industrial wastewater was diluted 10 times with high-performance liquid chromatography (HPLC) graded water and then mixed with 20 g activated carbon granules followed by incubating at 300r min−1 for 24 h at room temperature. After filtration, 500 mL ACD treated wastewater was reacted with 500 mL 64.29 ± 1.36 mg L−1 ozone solution at room temperature for 24 h. The chemical oxygen demand (COD) of wastewater were measured before and after treatments.

Sample handling

The wastewater samples were extracted with Supelclean™ ENVI™-18 solid phase extraction cartridge (500 mg, 3 mL, Sigma–Aldrich, USA). 40 μL eluent was divided into two aliquots: 20 μL was reacted with DnsCl and 20 μL with DnsHz. After dansylation, the samples were desalted using HPLC (Agilent 1260 Infinity II, USA) with C18 column (Supelco, 2.1 mm × 50 mm, 5 μm) at 25 °C, flow rate: 1 mL/min. The mobile phase A was 0.1% (v/v) formic acid in water and mobile phase B was 0.1% (v/v) formic acid in 95% acetonitrile. Liquid chromatography gradient was: t = 0 min, 5% B; t = 3 min, 5% B; t = 3.01 min, 95% B; t = 5 min, 95% B; t = 5.01 min, 5% B; t = 8 min, 5% B.

Non-targeted LC-MS analysis

Non-targeted analysis was performed using Dionex Ultimate 3000 UHPLC coupled to HESI Q-Exactive Plus Orbitrap mass spectrometer (Thermo Scientific, USA). A C18 column (Thermo Scientific, 2.1 mm × 100 mm, 1.8 μm) was used at 25 °C, with a flow rate of 300 μL min−1. Mobile phase A was 0.1% (v/v) formic acid in water, mobile phase B was 0.1% (v/v) formic acid in acetonitrile. Liquid chromatography gradient was: t = 0 min, 5% B; t = 3 min, 5% B; t = 4 min, 15% B; t = 14 min, 65% B; t = 16 min, 95% B; t = 18 min, 95% B; t = 18.01 min, 5% B; t = 20 min, 5% B. MS/MS spectra were accumulated by 10 MS/MS scans with collision energies ranged from 10 to 50 eV.

Data processing

Molecule classification based on structural features was performed using ClassyFire49. Peak lists of non-targeted data were extracted using MZmine 350. Molecule annotation at confidence level 3 was achieved by matching exact mass and MS/MS spectra in DnsBank. The annotation confidence level was complied with small molecule identifications defined by Schymanski, et al., in 201414. With the spectra matching threshold WCS ≥ 0.642, annotations were exported in csv file. Annotation efficiency was calculated to evaluate the percentage of query spectra confidently assigned to molecules by spectra matching or database searching 22.

$${{{\rm{Annotation}}}} \, {{{\rm{Efficiency}}}}=\frac{{{{\rm{number}}}} \, {{{\rm{of}}}} \, {{{\rm{molecule}}}} \, {{{\rm{annotations}}}}} {{{{\rm{number}}}} \, {{{\rm{of}}}} \, {{{\rm{query}}}} {{{\rm{ \, spectra}}}}}$$
(4)

Statistics

The unpaired t-test was performed in Prism (GraphPad Software, USA) to compare spectral WCS between two groups of data. For comparisons between three or more groups, ordinary one-way ANOVA was performed. The uniform manifold approximation and projection (UMAP) analysis was performed to analyze the distances between organic profiles of 3 wastewater samples. For all analyses, a two-tailed p value < 0.05 was considered to be statistically significant.

In vitro cytotoxicity study

The cell counting kit-8 (CCK-8) assay was used to monitor the cell viability of human hepatocellular cell line (Hep G2) in 96-wells incubating with various concentrations of chemicals for 72 h. The endpoint absorbance of each well was measured 6 times at 450 nm by Epoch 2 microplate spectrophotometer (Agilent BioTek, USA).

Implementation details

All DeepCDM models were constructed and trained in Tensorflow51. The hyperparameters were optimized on the hidden units and layers of MLP. Training hyperparameters were listed in Table S2. The optimized MLP architecture using ReLU activation had 3 layers of 500 nodes. Extended-Connectivity Fingerprints (ECFP4) with a length of 4096 and radius of 2 were set for the input of molecules. Further details on hyperparameter configurations were included in the code repository and can be obtained through Github (http://github.com/ADNLab-SCU/DeepCDMs).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.