论文题目:Decision tree supported substructure prediction of metabolites from GC-MS profiles
scholar 引用:228
页数:12
发表时间:2010.06
发表刊物:Metabolomics
作者:Jan Hummel, Nadine Strehmel,..., Joachim Kopka
摘要:
Gas chromatography coupled to mass spectrometry (GC-MS) is one of the most widespread routine technologies applied to the large scale screening and discovery of novel metabolic biomarkers. However, currently the majority of mass spectral tags (MSTs) remains unidentified due to the lack of authenticated pure reference substances required for compound identification by GC-MS. Here, we accessed the information on reference compounds stored in the Golm Metabolome Database (GMD) to apply supervised machine learning approaches to the classification and identification of unidentified MSTs without relying on library searches. Non-annotated MSTs with mass spectral and retention index (RI) information together with data of already identified metabolites and reference substances have been archived in the GMD. Structural feature extraction was applied to sub-divide the metabolite space contained in the GMD and to define the prediction target classes. Decision tree (DT)-based prediction of the most frequent substructures based on mass spectral features and RI information is demonstrated to result in highly sensitive and specific detections of sub-structures contained in the compounds. The underlying set of DTs can be inspected by the user and are made available for batch processing via SOAP (Simple Object Access Protocol)-based web services. The GMD mass spectral library with the integrated DTs is freely accessible for non-commercial use at http://gmd.mpimp-golm.mpg.de/. All matching and structure search functionalities are available as SOAP-based web services. A XML + HTTP interface, which follows Representational State Transfer (REST) principles, facilitates read-only access to data base entities.
Keywords
Metabolic markers, Gas chromatography (GC), Mass spectrometry (MS), GC-MS, Mass spectral classification, Mass spectral matching, Metabolite fingerprinting, Metabolite profiling, Metabolomics, Metabonomics, Decision trees
结论:
The mass spectrum and RI compendium of the GMD has been used as a training data set for a supervised machine learning approach using a DT algorithm for the classification of MSTs and the retrieval of human-interpretable classification rules. The success of updating efforts can be assessed using DT cross validation (CV) parameters, such as precision, recall, F pr -measure, MCC, and the CV error, which have been implemented in the course of this project to compare DT performance and to reject weak prediction models. the extension towards DT analyses of those substructures, which are underrepresented in the current GMD dataset, appears to be an attractive goal.Finally, given the availability of multiple DTs for the prediction of one particular substructure, the application of DT forests may be worthwhile.Introduction:
Two factors contribute to the high complexity of typical GC-(TOF)-MS experiments. metabolites of interest need to be chemically modified, Thus, more than one single analyte per metabolite may be generated and thorough chemical interpretations of observed analytes with respect to their mass spectral and retention index (RI) properties are required. a compound library comparison as the most straightforward approach for identifying components from GC-MS analyses relies on the availability of authenticated pure reference substances. The lack of chemically synthesized reference substances and of purified bio-molecules may be seen as the current bottleneck of comprehensive compound identification as identification is only possible if the detected compound is present in the library of references compounds.Tools utilising RI information for the matching process adapted for the processing of large metabolite profiling experiments recommend manually supervised compound identification.These authors tested the prediction of 10 structural characteristics and found neural networks to be superior compared to k-nearest neighbour (KNN) classification, linear discriminant analysis, or principal component models.we developed a decision tree (DT) based classifierwe focus on metabolites and substructures of metabolic origin.we chose the DT algorithm to solve the classification problem.This algorithm is employed for the recognition of patterns in mass fragmentation spectra that distinguish classes of compounds which either contain or lack a specific predefined chemical moiety.正文组织架构:
1. Introduction
2. Methods
2.1 Integration of metabolite structures into the GMD
2.2 Decision tree supported substructure prediction
2.2.1 MST subsetting and RI assignment
2.2.2 Mass spectral features
2.2.3 Decision tree generation
2.2.4 Performance measures
3. Results and discussion
3.1 Evaluation of the decision tree performance
3.1.1 Cross validation of decision trees
3.1.2 Analysis of MST feature usage
3.1.3 Application cases
3.2 Availability, GMD web site and web services
4. Concluding remarks
正文部分内容摘录:
1. Biological Problem: What biological problems have been solved in this paper?
substructure prediction of metabolites from GC-MS profiles2. Main discoveries: What is the main discoveries in this paper?
Decision tree (DT)-based prediction of the most frequent substructures based on mass spectral features and RI information is demonstrated to result in highly sensitive and specific detections of sub-structures contained in the compounds.3. ML(Machine Learning) Methods: What are the ML methods applied in this paper?
In the current GMD release, 6,187 mass spectra are available representing 2,444 analytes and 1,535 metabolites. For every considered functional group, we classify MSTs to belong to either the functional group containing or non-containing groups. Thus, we perform a binary classification. 50-fold CVDecision tree4. ML Advantages: Why are these ML methods better than the traditional methods in these biological problems?
traditional methods: Automated mass spectral deconvolutionThe DT method was applied, because multiple parameter types, categorical and numerical, can be integrated and no assumptions about numerical parameter distributions and about the nature of discriminating functions, e.g. linear, non-linear, multimodal, are required.5. Biological Significance: What is the biological significance of these ML methods’ results?
The success of updating efforts can be assessed using DT cross validation (CV) parameters, such as precision, recall, F pr -measure, MCC, and the CV error, which have been implemented in the course of this project to compare DT performance and to reject weak prediction models. As the DT algorithm supports this approach, we reasoned that the biologist and GMD user’s interest lies more on the metabolite structure rather than the methoxyaminated and trimethylsilylated compounds inherently required for GC-MS based metabolite profiling.6. Prospect: What are the potential applications of these machine learning methods in biological science?
the extension towards DT analyses of those substructures, which are underrepresented in the current GMD dataset, appears to be an attractive goal.Finally, given the availability of multiple DTs for the prediction of one particular substructure, the application of DT forests may be worthwhile.7. Mine Question(Optional)
Because an SQL Server table is limited to 1,024 columns, the predictor variables had to be pre-selected. 这么艰苦!