Data Processing And Analysis

You are here: Reference Content > Methods > Data Processing and Analysis

Acquisition of non-targeted mass spectrometry data produces large, complex datasets that can be tedious and difficult to manually interpret, even by an expert user. To address this challenge, software developers have created data processing and analysis tools for suspect screening and non-targeted analysis that reduce, modify, and aid in the interpretation of these complex datasets to produce meaningful information and achieve accurate sample classifications and chemical annotations/identifications. The data processing and analysis tools include individual algorithms and complete workflows that are available as proprietary software packages or as open-source software applications and scripts.

A critical, yet poorly defined, aspect of suspect screening and NTA is the terminology used to describe detections that are not yet annotated or identified. In the literature, terms such as “m/z-retention time pair”, “feature”, “detection”, “compound”, “non-target compound” are used, often with the exact usage differing across studies. We provide the following definitions for terms that are used in this document. We do recommend that NTA researchers adopt a universal terminology framework (such as that suggested here), to improve readability across NTA studies from different groups. Ultimately, it is most important that NTA researchers clearly define their intended meaning for any terms used.

m/z-retention time pair (mz@RT) – a unique pairing of mass-to -charge ratio and retention time (RT) values.

Feature – a set of mz@RT that is a grouping of associated MS1 components (e.g., isotopologue, adduct, and in-source product ion m/z peaks), and is represented as a tensor of observed retention time, monoisotopic mass, and intensity (e.g., peak area or peak height). Associated MS2 product ions may also be grouped with the MS1 components of a feature during HRMS data processing, depending on the software algorithms. If no groupings exist, a feature can be a single mz@RT. The term “molecular feature” may also be used.

For the purposes of this document and the NTA study reporting tool, we define a data processing and analysis method to be any workflow that uses (high-resolution) mass spectrometry data (acquired with the intent of performing suspect screening or non-targeted analysis) and employs any number of software tools to reduce, evaluate, and interpret that data. The most common three segments of an NTA data analysis method are 1) data processing (preceded by possible prerequisite data format conversion), 2) statistical & chemometric analysis, and 3) annotation & identification. These segments are defined below, and detailed information about the methods, tools, and recommended reporting for each is provided in the respective sections.

Data format conversion refers to conversion of raw data files to usable formats for subsequent data processing that does not intentionally interpret the raw data. For example, conversion of data files from proprietary vendor format to open data formats such as mzML, mzXML, or netCDF.
Data processing encompasses all steps that transform the raw data into meaningful information, prior to annotation & identification efforts. The inputs to data processing are raw or converted data file(s), and the output is a list of features in each sample (with associated chromatography, MS, and MS/MS data for each feature) for further analysis.
Statistical & chemometric analyses are used to aid summarization, evaluation, and interpretation of the reduced (but often still highly rich and complex) data that is produced by data processing, and to provide information about trends, clusters, or other relationships between samples and/or detections.
Annotation is the attribution of one or more properties or molecular characteristics to an MS1 feature (or component thereof, such as an isotopologue or adduct), or MS/MS product ion. Specific component, feature, or product ion annotations may not provide enough evidence to confidently identify a single compound (see Figure 3.1, Annotation & Identification). Examples of annotations include designation of an observed mz@RT as a specific adduct, assignment of a molecular formula to a feature or an MS/MS product ion, and assignment of a suggested substructure to an MS/MS product ion (see Table 4.2 in Data Outputs for additional examples).
Identification is the case where the annotated components, features, and/or product ions provide enough evidence to attribute a specific compound, within a stated identification scope (or at a stated confidence level), to the detected feature (see Figure 3.1, Annotation & Identification).

Data Processing

You are here: Reference Content > Methods > Data Processing and Analysis > Data Processing

Throughout the data processing section, we use specific terminology to indicate certain data processing steps, but also provide accompanying descriptions. Currently, there are terminology differences across both open-source and proprietary software platforms, where different software tools may use the same term to describe different steps (or vice versa). Until those discrepancies are resolved, through communication and consensus among the various software developers, NTA researchers should both take care to read software user manuals (to ensure they correctly interpret the purpose of each step in the workflow) and provide accompanying descriptions and/or settings for workflow steps that are listed in published studies (to ensure that the community can accurately interpret the data processing scheme).

Data Format Conversion

Data format conversion encompasses any steps that convert and/or re-format the data to enable subsequent data processing but do not intentionally interpret or clean the raw data. For all software tools, although data conversion steps may be grouped with data processing steps into a single data analysis method, format conversion must occur before data processing. Importantly, although data format conversions do not intentionally reduce the data (i.e., idealized data format conversions are not intended to remove any data), it is possible that data losses may occur. Thus, the NTA researcher should carefully evaluate their format conversion steps and associated settings to assess whether such data losses are occurring (although such an assessment may require evaluating multiple format conversion platforms, and proceeding through subsequent data processing and analysis steps to screen for known compounds, such as those in a QC spike).

We recommend that NTA researchers thoroughly document all conversion steps. For publications, we recommend that the main steps are noted in the main text, and all detailed steps and associated settings are provided in the supporting information. Some examples of data format conversion steps are shown below in Table 3.1:

Table 3.1 Data Format Conversion Methods
Method Step	Description
File conversion to open-source format	Changes the raw data file format (e.g., .d, .raw, .wiff, etc.) to a different file format. Third-party and open-source data analysis software tools often require open-source file formats, such as mzML, mzXML, ANDI-MS (netCDF), and ABF. Examples of open-source file converters and additional resources: mzML Schema (http://www.psidev.info/mzML) “mzML—a Community Standard for Mass Spectrometry Data” (Martens et al., 2011) Proteowizard MSConvert (http://proteowizard.sourceforge.net/) Reifycs Abf Converter (https://www.reifycs.com/AbfConverter/)
Mass spectrum centroiding	For mass spectra that are collected in profile/continuous mode, centroiding reduces the individual m/z peaks to a single peak (a centroid). This step is not necessary for mass spectra collected in centroid mode.

Data Processing

Data processing relies on user-defined settings and thresholds to reduce the raw or converted analytical data to meaningful information, such as a list of features with relative abundance information (e.g., peak height or area) and accompanying LC data and MS & MS/MS spectra. The resulting features can be passed to downstream steps, such as statistical & chemometric analyses or annotation and identification efforts. As data is intentionally reduced during processing, NTA researchers should consider evaluating the impact of various settings through the use of QC spikes and/or QC samples. More information about assessment of NTA performance can be found in QA/QC Metrics.

We recommend that NTA researchers report all software programs/algorithms, all included method steps, and all thresholds or other settings used in their data processing workflow to make decisions from feature selection to compound identification. For publications, we recommend that the primary software programs and method steps are noted in the main text, and all other associated details are provided in the supporting information. Some examples of data processing steps are shown below in Table 3.2, followed by a non-exhaustive list of representative software platforms that are available to NTA researchers in Table 3.3. There are differences in terminology across software platforms, which remains a challenge for researchers in comparing and developing new methods. Researchers using this table will need to carefully consider the provided description of each method step, as their software platform of choice may use a different term for that specific step. Ultimately, we recommend that software vendors and NTA researchers take steps to harmonize their use of terminology across data processing software platforms.

Table 3.2 - Data Processing Steps
Note: these method steps are not exhaustive, but are meant to represent the majority of steps in a representative data processing method
Name	Description
Initial m/z (mass) detection	Selection of unique mz@RT pairs.
Retention time alignment	Modifies retention times within a single dataset based on information from representative compounds within the current dataset (either user-selected or algorithm-selected) or based on information from one or more other datasets.
Shoulder peaks filtering	Removes noise signals known as “shoulder peaks”. This step is relevant to data collected with Fourier transform MS instruments.
Signal thresholding	Removes signal (m/z values) from the data that are below a designated abundance threshold (typically based on height). The threshold can either be an absolute value or a ratio (i.e., S/N), in which case the user may also need to designate a noise threshold.
Chromatogram smoothing	Reduces the noise of a selected chromatogram.
Spectral deconvolution	Removal of undesired m/z peaks within a mass spectrum through various algorithms.
Isotopologue grouping	Grouping of unique mz@RT pairs that represent isotopologues of the same compound.
Adduct grouping	Grouping of unique mz@RT pairs (or isotopologue groups) that represent adducts of the same compound.
Between-sample alignment	Comparison of the detected features in two or more samples to determine if the same feature was detected (typically based on allowed variance or set windows for m/z and RT differences), and subsequent grouping/alignment of features that are detected in multiple samples. In some software platforms: for “grouped” or “aligned” features that are detected in multiple samples, an average or median m/z and RT may be subsequently used as the feature identifier.
Gap-filling	Detection of features that were missed during initial m/z selection and chromatogram extraction. Depending on the software platform, this may be a recursive extraction of real peaks (using slightly lower thresholds than the initial settings), or may use an algorithm to predict and generate likely peaks.
Feature filtering (m/z and RT range)	Filtering detected features based on retention time or m/z range.
Duplicate feature removal	Removing duplicate features, based on designated m/z and RT windows.
Replicate filter	Evaluating the frequency of feature detection across analytical or extraction replicates. Features that do not meet a designated replicate frequency threshold may be designated non-detect (e.g, set to zero abundance or a selected minimum abundance threshold).
Abundance thresholding and/or blank comparison	Applying an absolute peak height/area threshold to remove peaks, applying a peak height/area threshold based on a comparison to the peak height/area of the same feature detected in one or more blanks, or adjusting peak areas based on observed abundances in one or more blanks (e.g., blank subtraction).
Abundance normalization and/or scaling	Correction, normalization, or scaling of feature abundance (height or peak area), based on abundance data from isotope-labeled spiked standards, other quality control spikes, or other approaches such as z-score. Note that methods to correct the abundance of unannotated/unidentified features are established in the metabolomics literature (e.g. Sysi-Aho et al., 2007, Ejigu et al., 2013). Examples of approaches have been described by Drotleff and Lammerhofer 2019 – although their review is specific to lipidomics, some studies have applied similar approaches to environmental samples (e.g., Boysen et al. 2017). NTA researchers should carefully document the data used to generate correction, scaling, or normalization factors, as well as the process used to apply such factors.
Mass spectral library searching	Comparison of the experimental mass spectrum of a feature to a library of mass spectra of known compounds. An estimate of similarity is calculated (algorithms are often specific to the software tool, but should be described if possible).
Feature annotation and identification	Assignment of annotated property(ies) and chemical identity to a feature.

Table 3.3 - Data Processing Software
Note: these tools are not exhaustive, but are meant to be representative of software that is available to NTA researchers. Additional non-targeted analysis software tools can be found at: Non-Targeted Analysis Software Tools
Name	Access Type	Website
ChromaTOF	LECO Corporation	https://www.lecosoftware.com https://www.leco.com/product/chromatof-software
Compound Discoverer	Thermo Fisher Scientific	https://www.thermofisher.com/us/en/home/industrial/mass-spectrometry/liquid-chromatography-mass-spectrometry-lc-ms/lc-ms-software/multi-omics-data-analysis/compound-discoverer-software.html https://planetorbitrap.com/compound-discoverer https://mycompounddiscoverer.com/
enviMass	freely available for contributors; purchased through enviBee	https://www.envibee.ch/eng/enviMass/overview.htm
enviPick	Open-source R package*	https://cran.r-project.org/package=enviPick
LabSolutions Insight Explore	Shimadzu		MassHunter Profinder	Agilent Technologies	https://www.agilent.com/en/product/software-informatics/mass-spectrometry-software/data-analysis/mass-profiler-professional-software#howitworks
Mass Profiler Professional	Agilent Technologies	https://www.agilent.com/en/product/software-informatics/mass-spectrometry-software/data-analysis/mass-profiler-professional-software
MassHunter Qualitative Analysis	Agilent Technologies	https://www.agilent.com/en/product/software-informatics/mass-spectrometry-software/data-analysis/qualitative-analysis
MS-DIAL	open-source	http://prime.psc.riken.jp/compms/msdial/main.html
MassHunter Quantitative Analysis	Agilent Technologies	https://www.agilent.com/en/product/software-informatics/mass-spectrometry-software/data-analysis/quantitative-analysis
mzMine	open-source	http://mzmine.github.io/
nontarget	Open-source R package*	https://cran.r-project.org/package=nontarget
SCIEX OS	Sciex	https://sciex.com/products/software/sciex-os-software-x120805
SIRIUS + CSI:FingerID	open-source, java-based (GUI and CLI or Command Line version)	https://bio.informatik.uni-jena.de/software/sirius/
XCMS	open-source (online or via R script)	https://xcmsonline.scripps.edu/

*There are a vast number of R packages that can be used to support NTA, an excellent summary of them was reported by Stranstrup et al. 2019.

Statistical & Chemometric Analysis

You are here: Reference Content > Methods > Data Processing and Analysis > Statistical & Chemometric Analysis

Statistical & chemometric analyses of processed data are used to aid summarization, evaluation, and interpretation of the reduced (but often still highly rich and complex) data. Basic statistical calculations and analyses (e.g., standard deviation, analysis of variance, Wilcoxon rank sum test, Chi-square test) are often used to summarize data, evaluate variability, and test basic hypotheses. Chemometric analyses (e.g., differential analysis, hierarchical clustering, dimensionality reduction) are often used to understand trends, clusters, and relationships between individual samples and/or individual features. As for data processing, we recommend that NTA researchers thoroughly document the goals of all statistical & chemometric analyses, the samples/sample groups to which such analyses were applied, and all software programs, methods, algorithms, and settings used to perform the analyses. Proposed questions to evaluate the quality of reporting about statistical outputs are provided in Data Outputs (Table 4.1), which also describes some specific details relevant to statistical methods reporting.

Examples of statistical analysis approaches are summarized in Table 3.4. Additional notes on relevant literature are provided below the table.

Table 3.4 – Statistical & Chemometric Analysis Methods
Method Type	Description	Examples of Specific Methods	Example Literature
Basic statistical analyses	Calculations and analyses that summarize the NTA data, evaluate variability, and test basic hypotheses.	● Standard deviation ● one-way analysis of variance (ANOVA) ● Wilcoxon rank sum test ● Chi-square test ● t-test
Data prioritization	Analysis of the data to isolate features that are: 1) routinely observed over multiple sample sets, 2) uniquely observed in a single sample set, 3) outliers across multiple sample sets, or 4) exhibit other trends or features of interest. Data prioritization is intended to reduce the data complexity and deem specific features a higher priority for further study. Specific approaches to data prioritization statistical approaches (e.g., dimensionality reduction, cluster analysis, etc.) are detailed below. Prioritization based on features of interest may employ approaches such as mass defect filtering, homologous series detection (e.g., by Kendrick mass defect analysis), or isotope pattern analyses (e.g., to find specific elements, such as halogens).	See methods detailed below.	Hollender et al., 2017 Reichenbach et al., 2011
Dimensionality reduction	Analyses that identify patterns and groupings in the data by evaluating relationships between features and/or samples. Dimension reduction analyses can be used to: ● select/prioritize features ● classify features or samples ● provide a condensed view of high dimensional data, as visual detection of patterns can be useful to gain important insights from datasets.	● Principal Component Analysis (PCA) ● Partial Least Squares - Differential Analysis (PLS-DA) ● Non-metric Multidimensional Scaling (NMDS) Analysis ● Locally Linear Embedding (LLE) ● t-Distributed Stochastic Neighbor Embedding (t-SNE) ● Linear Discriminant Analysis (LDA)	Roweis and Saul, 2000 Reichenbach et al., 2011 Antenoelli et al., 2018 Gao et al., 2019 Kalogiouri et al., 2016 Cavanna et al., 2020 Siren et al., 2019
Cluster analysis	Cluster analysis involves grouping samples by their similarities, and can be a tool used in data analysis, anomaly detection, dimensionality reduction and semi-supervised learning.	Clustering Methods: ● Hierarchical Cluster Analysis (HCA) ● k-means ● DBSCAN ● Gaussian Mixture Model (GMM) Methods to Determine Optimal Number of Clusters: ● Silhouette (precise, but computationally expensive) ● Elbow method (an “elbow” is an inflection point that may be used when comparing the number of clusters against inertia)	Cavanna et al., 2018 Gao et al., 2019
Differential analysis	In a case-control study design, differential analyses are intended to indicate/select features that distinguish the samples in the case set from those in the control set. Univariate statistical analysis can be applied to compare two or more independent (or dependent) groups. In univariate statistical analysis, parametric tests can be applied when the data is assumed to follow a normal distribution, and nonparametric tests can be applied when no assumption is made regarding the data distribution. Shape analysis may also be used to compare chromatographic and mass spectral features. Hundreds or thousands of features may be visualized using cloud plots.	Parametric tests: ● independent t test ● paired t test ● one-way ANOVA with post-HOC ● repeated measures ANOVA Non-parametric tests: ● Mann-Whitney ● Wilcoxon signed-rank ● Kruskal-Wallis ● Friedman tests Shape analysis: ● Generalized Procrustes Analysis Visualization: ● Cloud plots	Vorst, et al., 2005 Katajamaa and Orešič, 2005 Gowda et al., 2014 Patti et al., 2013 Stegmann and Gomez, 2002 Gao et al., 2019
Pathways analysis (for exposomics and metabolomics)	The translation of individual features into pathway information for evaluation at the system biology/toxicology level. For example, an experiment might involve dosing in animal or cellular models, and subsequent evaluation of feature changes.	Univariate analysis to assess changes in features Cloud plots (often used in metabolomics to evaluate feature changes and aid in determining pathways).	Warth et al., 2017

Notes on additional resources about statistical analysis:

Review Articles: There are few good statistical analysis methods reviews in NTA, beyond those specific to metabolomics. Bartel et al. (2013) provides a good primer on statistical techniques used in high-throughput metabolomics studies, covering many of the methods described in this table as a short review article and serving as a primer for those new to the field. Antonelli et al. (2018) also provides a slightly longer resource.

Data Prioritization: Reichenbach et al. (2011) describes non-targeted cross-sample analysis workflows, focusing on GCxGC and LCxLC datasets. Reichenbach is also the primary author of GC-Image, a software package for GC chromatogram data analysis. The review covers five primary areas: visual image comparisons, datapoint feature analysis, peak feature analysis, region feature analysis and peak-region analysis.

Dimensionality Reduction: Reichenbach et al. (2011) and Antonelli et al. (2018) describe how principal component analysis (PCA), and other dimensionality reduction techniques, are used in NTA and metabolomics workflows, while also outlining special considerations when working with NTA datasets. Gao et al. (2019) also summarizes data analysis techniques used in food fraud detection, including PCA and partial least squares-discriminant analysis (PLS-DA), and includes references to studies where PCA was used for testing authenticity of olive oil, milk, honey, meat, and so forth. Kalogiouri et al. (2016) reviews targeted and non-targeted LC-QTOF-MS in food authenticity, and includes recommendations for data analysis methods with a lot of nice examples and good visualizations for various techniques such as PLS-DA. Other examples of specific studies include an NTA study by Cavanna et al. (2020) that looked at HRMS data generated from two different laboratories in the analysis of olive oil in a lipidomics study, and work by Siren et al. (2019) that used t-SNE, another dimensionality reduction technique, in their automated workflow for non-targeted GC-MS exploratory data analysis on rice-grain development.

Cluster Analysis: Gao et al. (2019) highlights the use of hierarchical cluster analysis (HCA) in food authenticity studies.

Differential Analysis: Gao et al. (2019) describes a method for univariate analysis in food fraud studies, by way of a conformity index (CI). CI is a measure of the number of standard deviations for each sample. Any samples that have a CI greater than the maxCI defined for that study would be considered atypical, which would raise concerns about their authenticity.

Annotation and Identification

You are here: Reference Content > Methods > Data Processing and Analysis > Annotation and Identification

The goal of many NTA studies is to identify chemical compounds in complex mixtures, such as environmental waters or human serum. Data analysis methods may use a variety of approaches and tools to characterize the chemical composition of the sample, within the defined scope of the study (see: Objectives & Scope). In addition to defining the scope of the study (e.g., the chemical space considered), we recommend that NTA researchers also apply the concept of scope to chemical identifications to clearly define the limits of both the analytical approach and the evidence used to assign the identification. For example, it is often difficult to distinguish between multiple isomers of a compound by product ion (MS/MS) spectra; therefore, the scope of an identification may be limited to all possible isomers (e.g., possible enantiomers).

Below, we provide definitions of relevant terminology, followed by information about the data analysis methods used for annotation and identification, strategies and requirements to confirm structural proposals, and the use of compound databases and spectral libraries. Details about suggested reporting on annotation and identification results (including identification scope/confidence levels) are discussed in the Data Outputs section.

Annotation of a feature is the attribution of one or more properties or molecular characteristics to an MS1 feature (or component thereof, such as an isotopologue, adduct, or in-source product ion), or MS/MS product ion. Specific component, feature, or product ion annotations may not provide enough evidence to confidently identify a single compound (see Figure 3.1, Annotation & Identification). Examples of annotations include designation of an observed mz@RT as a specific adduct, assignment of a molecular formula to a feature or an MS/MS product ion, and assignment of a suggested substructure to an MS/MS product ion (see Table 4.2 in Data Outputs for additional examples).
Identification of a feature is the attribution of a specific compound, within a stated identification scope (or at a stated confidence level), to a detected feature(s), when the annotated components or product ions provide enough evidence. (see Figure 3.1).

Examples of Annotation & Identification:

Example #1: A feature has an exact mass of m/z 498.9302 (retention time 15 min) and an observed isotopic distribution, but there is no product ion information. Thus, a structure cannot be assigned, but the feature can be annotated with a molecular formula (C8F17HSO3), with supporting information for the molecular formula assignment from the isotopic distribution and mass error (the stated scope of the annotation).
If, for example, searching this molecular formula against an online chemical database containing metadata indicated that the chemical with this molecular formula that has the most literature sources is perfluorooctane sulfonic acid (PFOS) – then the compound could be tentatively identified as PFOS, albeit with lower confidence (i.e., the stated scope of the identification is limited to the molecular formula and supporting metadata information, without confident structural confirmation).
Example #2: A feature has enough annotated properties to attribute a specific feature to alanine, but within the design of the study the feature cannot be determined to be L-alanine or D-alanine. Therefore the feature can be labeled as identified as L/D-alanine, where the compound structure is presented without stereospecificity (stated scope of the identification).

figure showing the data used for annotation and identification — **Figure 3.1**. Schematic of the distinction between annotation and identification.

Data Analysis Methods for Identification

Property annotation can occur manually by an expert user or can be performed automatically using specialized software, such as those listed in Table 3.3. Data analysis methods specific to annotation and identification efforts are listed in Table 3.5.

Table 3.5 – Data Analysis Methods for Annotation and Identification
Method	Description	Example Information to Report	More information or example literature
Molecular formula prediction/assignment	Generation of a molecular formula for an observed feature, using information such as exact mass (m/z) and observed adducts, fragments, and isotopologues.	● Software platform used for formula prediction, and the algorithm used by the software ● Thresholds/settings for predicting the molecular formula ● Method for calculating a match score between predicted/observed isotopic distribution	Kind & Fiehn, 2007
Assignment of functional groups	Prediction of specific chemical functional groups in the observed compound, based on observed fragments or other experimental methodology.	● Any preparative method used to determine acid/base functional groups ● Spectroscopic (NMR/IR/UV-Vis) methods that were used to identify functional groups. ● Screening tools (e.g., Kendrick Mass Defect, homologous series searching) used to group known chemical class members with an unknown feature ● Reporting on the use of standards/known compounds to verify the efficacy of methods intended to group by chemical class	D’Agostino & Mabury, 2014
in silico fragmentation	Creation of in silico fragmentation mass spectra based on the structure of the proposed identity, followed by comparison of the generated spectra to the experimental mass spectra. Algorithms to score the quality of the match between the in silico and experimental spectra are often specific to the software platform.	● The in silico software or technique ● Size of the in silico training set ● Type of scoring ● Method/approach to validate the in silico fragmentation-based matches within the study (if used)	Blazenovic et al., 2017
Mass spectral interpretation	The annotation of compound fragment peaks to “build” the unknown compound structure or provide supporting information for the identity. Mass spectral interpretation can be performed through software algorithms, or manually by an NTA expert.	● Method (algorithm) and/or software used for mass spectral interpretation, with associated settings	Barzen-Hanson et al. 2017
Structural similarity searching	Using annotated fragment peaks, compounds of similar structure can be searched to provide possible chemical classification.	● Size of structure database and similarity estimating algorithms.	Cooper et al., 2019
Compound database searching	Searching a compound database based on precursor ion m/z (or neutral mass, formula, etc.) to find possible candidate compounds for identification. Chemical structures and compound metadata can provide supporting evidence for identification(s).	● The compound database used and summary statistics, including number of compounds and type of metadata ● Methods for use of metadata to assign compound identity (e.g. ranking of literature sources) ● Data of database access (unless database is a static product)	McEachran et al., 2017 See section on Usage of Compound Databases and Spectral Libraries
Spectral library searching	Searching a library of mass spectra (empirical or in silico) associated with specific compounds based on experimental mass spectra. Scoring of the spectral matches is critical to identification and can be conducted using a variety of mathematical approaches.	● The mass spectral library used and summary statistics, including number of compounds and spectra in the library ● Scoring algorithm used for spectral matching ● Data of library access (unless library is a static product)	Stein and Scott, 1994 See section on Usage of Compound Databases and Spectral Libraries

Usage of Compound Databases and Spectral Libraries

A Compound Database is a structured collection of chemical substance information (e.g., chemical identifiers, intrinsic properties, structural identifiers, and retention times) in an exchangeable format and, commonly, a visually interpretable format. Databases are often used for compound-level annotation queries as well as for data compilation, organization, and management.
A Spectral Library is a repository of mass spectra (MS, MS/MS, MSⁿ) formatted for direct spectral matching to support annotation and identification. The spectral library may include association with compound-level information (e.g., chemical identifiers such as CAS number and intrinsic properties).

For appropriate usage in context, the main distinguishing factor between a library and database is the manner in which data are structured:

Databases are structured collections of information. For example, a compound database may contain chemical structure data, method information, quality control data, etc. all linked and/or related in a compound-centric manner. In contrast, libraries are spectral-centric repositories of mass spectra formatted for direct spectral matching. A spectral library may be stored in a database or database format, but a database can contain many levels of relevant data in addition to spectra and exist with or without spectra.

The data contained within libraries and databases are often accessed via intermediary applications (installable software applications, web applications, etc.). These applications are distinct from the databases and libraries themselves, but are critical for accurate and efficient data access and management. There are also numerous means to calculate a spectral match when querying a spectral library with experimental data. When using spectral libraries, it is recommended to state both the intermediary application(s) and spectral match algorithm(s) used in the identification workflow. The level of curation greatly impacts the efficacy of a library or database and, most importantly, the accuracy with which its data can be applied. Accordingly, it is recommended that curation levels be transparent in libraries and databases (see Table 4.3 on Results Reporting: Compound Database or Spectral Library).

Many libraries and databases, both openly accessible and not, are frequently used by researchers (Table 3.6; see also: Additional Resources – Online Databases & Libraries).

Table 3.6 – Common Libraries and Databases
(updated October 2020)
Library/Database	Description	Size	Access Type	Website
MassBank	Public repository of mass spectra driven by user depositions, the first repository of its kind for small molecules	~14000 unique compounds, ~80000 unique spectra	Web application, GitHub for data downloads	https://massbank.eu/MassBank/
MassBank of North America (MoNA)	Public, auto-curating repository intended to focus on metadata and designed for efficient storage/querying of mass spectra	~223,000 unique compounds, ~650,000 spectral records	Web application, data downloads, API	https://mona.fiehnlab.ucdavis.edu/
CompTox Chemicals Dashboard	Public web resource of high-quality, structure-curated, open data for environmental chemistry and computational toxicology	~882,000 compounds	Web application, data downloads, API	https://comptox.epa.gov/dashboard
PubChem	Open chemistry database with chemical structures, identifiers, properties, etc.	111M compounds	Web application, data downloads, API	https://pubchem.ncbi.nlm.nih.gov/
ChemSpider	Free chemical structure database with identifiers, properties, etc.	90M chemical compounds	Web application, API	http://www.chemspider.com/
Metlin	Spectral repository and search engine for identification, local version for purchase	>1M compounds ~850,000 with MS/MS spectra	Web application, local version	https://metlin.scripps.edu/
DrugBank	Freely accessible pharmaceutical knowledge base including chemical structures and clinical level information	~13,600 compounds	Web application	https://www.drugbank.ca/
HMDB	Freely available electronic database containing detailed information about small molecule metabolites found in the human body	~114,000 compounds	Web application	https://hmdb.ca/
Global Natural Products Social Molecular Networking (GNPS)	Web-based mass spectrometry knowledge base for community-wide organization and sharing of data	~78,403 spectra	Web application, data downloads	https://gnps.ucsd.edu/

Confirmation of Structural Proposal

A commonly accepted strategy to confirm a structural proposal (i.e., a tentative identification) is by measurement of an authentic reference standard with MS and/or MSⁿ and retention time matching. If possible, an orthogonal method should be used to support the identification. Reference standards used for the confirmation of a chemical structure should be characterized by providing sufficient evidence for the chemical identity of the standard (e.g., Certificate of Analysis (CoA)).

Minimum requirements for confirmation of chemical structures are widely discussed in the scientific community on a detailed level, as are various schemes for reporting identification confidence. Nevertheless, there is common agreement on the basic requirements throughout regulatory entities and key scientific opinion leaders; more information can be found in the following guidelines or communications:

U.S. Food and Drug Administration, “Acceptance Criteria for Confirmation of Identity of Chemical Residues using Exact Mass Data for the FDA FVM Program” (2015): https://www.fda.gov/media/96499/download
Schymanski et. al., “Identifying Small Molecules via High Resolution Mass Spectrometry: Communicating Confidence” (2014): https://doi.org/10.1021/es5002105
European Commission, “Guidance document on analytical quality control and method validation procedures for pesticide residues and analysis in food and feed” (2017): https://ec.europa.eu/food/sites/food/files/plant/docs/pesticides_mrl_guidelines_wrkdoc_2017-11813.pdf
NIST, “Metrological Tools for the Reference Materials and Reference Instruments of the NIST Material Measurement Laboratory” (2020): https://doi.org/10.6028/NIST.SP.260-136-2020

References & Other Relevant Literature

Aliakbarzadeh, G., Sereshti, H., & Parastar, H. (2016). Pattern recognition analysis of chromatographic fingerprints of Crocus sativus L. secondary metabolites towards source identification and quality control. Analytical and Bioanalytical Chemistry, 408(12), 3295-3307. doi:10.1007/s00216-016-9400-8

Antonelli, J., Claggett, B., Mir, H., Watrous, J. D., Lehmann, K. A., Huschcha, P., . . . Cheng, S. (2018). Statistical Methods and Workflow for Analyzing Human Metabolomics Data. arXiv. Retrieved from https://arxiv.org/abs/1710.03436

Bartel, J., Krumsiek, J., & Theis, F. J. (2013). Statistical methods for the analysis of high-throughput metabolomics data. Computational and Structural Biotechnology Journal, 4, e201301009. doi:10.5936/csbj.201301009

Barzen-Hanson, K.A., Roberts, S.C., Choyke, S., Oetjen, K., McAlees, A., Riddell, N., McCrindle, R., Ferguson, P.L., Higgins, C.P., & Field, J.A. (2017). Discovery of 40 Classes of Per- and Polyfluoroalkyl Substances in Historical Aqueous Film-Forming Foams (AFFFs) and AFFF-Impacted Groundwater. Environmental Science & Technology, 51(4), 2047-2057. doi:10.1021/acs.est.6b05843

Beauchamp, C. R., Camara, J. E., Carney, J., Choquette, S. J., Cole, K. D., DeRose, P. C., . . . Sieber, J. R. (2020). Metrological Tools for the Reference Materials and Reference Instruments of the NIST Material Measurement Laboratory. doi:10.6028/nist.Sp.260-136-2020

Blaženović, I., Kind, T., Torbašinović, H., Obrenović, S., Mehta, S. S., Tsugawa, H., . . . Fiehn, O. (2017). Comprehensive comparison of in silico MS/MS fragmentation tools of the CASMI contest: database boosting is needed to achieve 93% accuracy. J Cheminform, 9(1), 32. doi:10.1186/s13321-017-0219-x

Boysen, A. K., Heal, K. R., Carlson, L. T., & Ingalls, A. E. (2018). Best-Matched Internal Standard Normalization in Liquid Chromatography–Mass Spectrometry Metabolomics Applied to Environmental Samples. Analytical Chemistry, 90(2), 1363-1369. doi:10.1021/acs.analchem.7b04400

Cavanna, D., Hurkova, K., Džuman, Z., Serani, A., Serani, M., Dall’Asta, C., . . . Suman, M. (2020). A Non-Targeted High-Resolution Mass Spectrometry Study for Extra Virgin Olive Oil Adulteration with Soft Refined Oils: Preliminary Findings from Two Different Laboratories. ACS Omega, 5(38), 24169-24178. doi:10.1021/acsomega.0c00346

Cavanna, D., Righetti, L., Elliott, C., & Suman, M. (2018). The scientific challenges in moving from targeted to non-targeted mass spectrometric methods for food fraud analysis: A proposed validation workflow to bring about a harmonized approach. Trends in Food Science & Technology, 80, 223-241. doi:10.1016/j.tifs.2018.08.007

Cooper, B. T., Yan, X., Simon-Manso, Y., Tchekhovskoi, D. V., Mirokhin, Y. A., & Stein, S. E. (2019). Hybrid Search: A Method for Identifying Metabolites Absent from Tandem Mass Spectrometry Libraries. Analytical Chemistry, 91(21), 13924-13932. doi:10.1021/acs.analchem.9b03415

D’Agostino, L.A., & Mabury, S.A (2014). Identification of Novel Fluorinated Surfactants in Aqueous Film Forming Foams and Commercial Surfactant Concentrates. Environmental Science & Technology, 48(1), 121-129. doi:10.1021/es403729e

Drotleff, B., & Lämmerhofer, M. (2019). Guidelines for Selection of Internal Standard-Based Normalization Strategies in Untargeted Lipidomic Profiling by LC-HR-MS/MS. Analytical Chemistry, 91(15), 9836-9843. doi:10.1021/acs.analchem.9b01505

Ejigu, B.A., Valkenborg, D., Baggerman G., Vanaerschot, M., Witters, E., Dujardin, J., Burzykowski, T., & Berg, M. (2013). Evaluation of Normalization Methods to Pave the Way Towards Large-Scale LC-MS-Based Metabolomics Profiling Experiments. OMICS, 17(9), 473-485. doi:10.1089/omi.2013.0010

European Commission. (2017). Guidance document on analytical quality control and method validation procedures for pesticide residues and analysis in food and feed. (SANTE/11813/2017). https://ec.europa.eu/food/sites/default/files/plant/docs/pesticides_mrl_guidelines_wrkdoc_2017-11813.pdf

Fisher, C. M., Croley, T. R., Knolhoff, A. M. (2021). Data processing strategies for non-targeted analysis of foods using liquid chromatography/high-resolution mass spectrometry. TrAc Trends in Analytical Chemistry, 136, 116188. doi: 10.1016/j.trac.2021.116188

Gao, B., Holroyd, S. E., Moore, J. C., Laurvick, K., Gendel, S. M., & Xie, Z. (2019). Opportunities and Challenges Using Non-targeted Methods for Food Fraud Detection. Journal of Agricultural and Food Chemistry, 67(31), 8425-8430. doi:10.1021/acs.jafc.9b03085

García, R. A., Chiaia-Hernández, A. C., Lara-Martin, P. A., Loos, M., Hollender, J., Oetjen, K., . . . Field, J. A. (2019). Suspect Screening of Hydrocarbon Surfactants in AFFFs and AFFF-Contaminated Groundwater by High-Resolution Mass Spectrometry. Environmental Science & Technology, 53(14), 8068-8077. doi:10.1021/acs.est.9b01895

Gowda, H., Ivanisevic, J., Johnson, C. H., Kurczy, M. E., Benton, H. P., Rinehart, D., . . . Siuzdak, G. (2014). Interactive XCMS Online: simplifying advanced metabolomic data processing and subsequent statistical analyses. Analytical Chemistry, 86(14), 6931-6939. doi:10.1021/ac500734c

Hollender, J., Schymanski, E. L., Singer, H. P., & Ferguson, P. L. (2017). Nontarget screening with high resolution mass spectrometry in the environment: Ready to go? Environmental Science & Technology, 51(20), 11505-11512. doi:10.1021/acs.est.7b02184

Kalogiouri, N. P., Alygizakis, N. A., Aalizadeh, R., & Thomaidis, N. S. (2016). Olive oil authenticity studies by target and nontarget LC-QTOF-MS combined with advanced chemometric techniques. Analytical and Bioanalytical Chemistry, 408(28), 7955-7970. doi:10.1007/s00216-016-9891-3

Katajamaa, M., & Orešič, M. (2005). Processing methods for differential analysis of LC/MS profile data. BMC Bioinformatics, 6, 179. doi:10.1186/1471-2105-6-179

Kind, T., & Fiehn, O. (2007). Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics, 8, 105. doi:10.1186/1471-2105-8-105

Martens, L., Chambers, M., Sturm, M., Kessner, D., Levander, F., Shofstahl, J., . . . Deutsch, E. W. (2011). mzML–a community standard for mass spectrometry data. Molecular & Cellular Proteomics, 10(1), R110 000133. doi:10.1074/mcp.R110.000133

McEachran, A. D., Sobus, J. R., & Williams, A. J. (2017). Identifying known unknowns using the US EPA’s CompTox Chemistry Dashboard. Analytical and Bioanalytical Chemistry, 409(7), 1729-1735. doi:10.1007/s00216-016-0139-z

Patti, G. J., Tautenhahn, R., Rinehart, D., Cho, K., Shriver, L. P., Manchester, M., . . . Siuzdak, G. (2013). A view from above: cloud plots to visualize global metabolomic data. Analytical Chemistry, 85(2), 798-804. doi:10.1021/ac3029745

Rampler, E., Abiead, Y. E., Schoeny, H., Rusz, M., Hildebrand, F., Fitz, V., & Koellensperger, G. (2020). Recurrent Topics in Mass Spectrometry-Based Metabolomics and Lipidomics-Standardization, Coverage, and Throughput. Analytical Chemistry, 93(1), 519-545. doi:10.1021/acs.analchem.0c04698

Reichenbach, S. E., Tian, X., Cordero, C., & Tao, Q. (2012). Features for non-targeted cross-sample analysis with comprehensive two-dimensional chromatography. Journal of Chromatography A, 1226, 140-148. doi:10.1016/j.chroma.2011.07.046

Rochat, B. (2017). Proposed Confidence Scale and ID Score in the Identification of Known-Unknown Compounds Using High Resolution MS Data. Journal of the American Society for Mass Spectrometry, 28(4), 709-723. doi:10.1007/s13361-016-1556-0

Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323-2326. doi:10.1126/science.290.5500.2323

Schymanski, E. L., Jeon, J., Gulde, R., Fenner, K., Ruff, M., Singer, H. P., & Hollender, J. (2014). Identifying small molecules via high resolution mass spectrometry: Communicating confidence. Environmental Science & Technology, 48(4), 2097-2098. doi:10.1021/es5002105

Sirén, K., Fischer, U., & Vestner, J. (2019). Automated supervised learning pipeline for non-targeted GC-MS data analysis. Analytica Chimica Acta: X, 1, 100005. doi:10.1016/j.acax.2019.100005

Stanstrup, J., Broeckling, C. D., Helmus, R., Hoffmann, N., Mathé, E., Naake, T., . . . Neumann, S. (2019). The metaRbolomics Toolbox in Bioconductor and beyond. Metabolites, 9(10). doi:10.3390/metabo9100200

Stegmann, M. G., D. (2002). A Brief Introduction to Statistical Shape Analysis. Informatics and Mathematical Modelling. Retrieved from https://graphics.stanford.edu/courses/cs164-09-spring/Handouts/paper_shape_spaces_imm403.pdf

Stein, S. E., & Scott, D. R. (1994). Optimization and testing of mass spectral library search algorithms for compound identification. Journal of the American Society for Mass Spectrometry, 5(9), 859-866. doi:10.1016/1044-0305(94)87009-8

Sysi-Aho, M., Katajamaa, M., Yetukuri, L., Orešič, M. (2007). Normalization method for metabolomics data using optimal selection of multiple internal standards. BMC Bioinformatics, 8(93). doi:10.1186/1471-2105-8-93

U.S. Food and Drug Administration. (2015). Acceptance Criteria for Confirmation of Identity of Chemical Residues using Exact Mass Data for the FDA FVM Program. https://www.fda.gov/media/96499/download

Vorst, O., Vos, C. H. R. d., Lommen, A., Staps, R. V., Visser, R. G. F., Bino, R. J., & Hall, R. D. (2005). A non-directed approach to the differential analysis of multiple LC–MS-derived metabolic profiles. Metabolomics, 1(2), 169-180. doi:10.1007/s11306-005-4432-7

Warth, B., Spangler, S., Fang, M., Johnson, C. H., Forsberg, E. M., Granados, A., . . . Siuzdak, G. (2017). Exposome-Scale Investigations Guided by Global Metabolomics, Pathway Analysis, and Cognitive Computing. Analytical Chemistry, 89(21), 11505-11513. doi:10.1021/acs.analchem.7b02759

Data Processing

Data Format Conversion

Data Processing

Statistical & Chemometric Analysis

Annotation and Identification

Data Analysis Methods for Identification

Usage of Compound Databases and Spectral Libraries

Confirmation of Structural Proposal

References & Other Relevant Literature

Share this: