Annotation (rather than full elucidation) of EI Spectra

In a typical GC-MS study, analytical chemists will usually submit their Electron-Impact Mass Spectra to a software system which attempts to identify each spectrum by matching it exactly to reference spectra from a well curated library. These libraries are generated by systematically running pure compounds under standard conditions to generate high-quality references: the NIST 08 EI library is a good example of such a library (it contains more than 200,000 compounds).

If an exact match is not found the analytical chemist is left with two choices: inspection of the nearest, imperfect hits or the application of software systems which attempt to elucidate the structure of the compound de-novo from the spectrum at hand. Software systems that do this include AMDIS and Mass Frontier. Central to these tools is the concept of chemical (sub)structure identification: the goal of the analysis is always to suggest a specific structure, or to classify spectra according to a user defined classification, driven typically by substructure constraints.

ARISTO attempts to match spectra directly to a formal standardized set of annotations, without explicitly analyzing the match between substructures and spectra. This is because ARISTO leverages a new development in the chemical informatics community, namely the emergence of ChEBI, a formal ontology which aims to cover Chemical Entities of Biological Interest. This resource, curated by the EBI, was first published in 2008 and has been experiencing exponential growth for the last 3 years. In a sense, ARISTO is treating using ChEBI to analyze small-molecule spectrometry in a fashion analogous to the way bioinformatics tools use the GO-ontology to analyze microarray data. Since ChEBI is still a very young resource (there are currently ~ 20,000 well-curated entries) and it the exact overlap between ChEBI and the NIST library is still very small (to date 3,000 NIST spectra have been analyzed according to the ChEBI ontology) the system must be approached as an experimental system to be used when exact spectral matches are unavailable and approximate matches do not yield an obvious conserved substructure.

The system returns a list of 388 annotations, corresponding to ChEBI entities (concepts) with at least 10 member compounds. For each concept a score is generated from which a probability of correctness is derived. The user can inspect ROC curves and precision/recall plots for the training data as well as follow a link to the official ChEBI entry for any given concept. To assess the capabilities of the system, 32 random compounds were withheld from the training phase. A detailed investigation of all 32 results is reported in an associated publication. Users can get a sense of the system's capabilities by trying out some of these testing-phase spectra, made available under the Examples tab of the website or upload the entire list which is made available in the Batch Mode tab.