Paper reading (五十二)：MetaPheno: A critical evaluation of ML in metagenome-based disease prediction

mac2025-04-16 43

论文题目：MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction

scholar 引用：3

页数：9

发表时间：2019.03

发表刊物：Methods

作者：Nathan LaPierre, Chelsea J.-T. Ju, Guangyu Zhou, Wei Wang

Highlights:

The human microbiome plays a critical role in human health and disease.Predicting disease status from metagenomic data is increasinglyl important.We review new methods for this task, focusing mostly on deep learning.We perform an in-depth analysis of challenging type 2 diabetes and obesity datasets.We offer perspectives on study design concerns and potential future directions.

摘要：

The human microbiome plays a number of critical roles, impacting almost every aspect of human health and well-being. Conditions in the microbiome have been linked to a number of significant diseases. Additionally, revolutions in sequencing technology have led to a rapid increase in publicly-avaliable sequencing data. Consequently, there have been growing efforts to predict disease status from metagenomic sequencing data, with a proliferation of new approaches in the last few years. Some of these efforts have explored utilizing a powerful form of machine learning called deep learning, which has been applied successfully in several biological domains. Here, we review some of these metods and the algorithms that they are based on, with a particular focus on deep learning methods. We also perform a deeper analysis of Type 2 Diabetes and obesity datasets that have eluded improved results, using a variety of machine learning and feature extraction methods. We conclude by offering perspectives on study design considerations that may impact results and future directions the field can take to improve results and offer more valuable conclusions. The scripts and extracted features for the analyses conducted in this paper are avaliable via GitHub: https://github.com/nlapier2/metapheno. (目前才8个star)

KeyWords: Deep learning, Machine learning, Metagenomics, Phenotype prediction

结论：

Disease prediction using metagenomic sequence data has shown some potential, with a particularly large amount of effort having been put into deep learning methods, but remains challenging. 这主要是讲宏基因组学序列数据的？那非序列呢？Study methodology must remain consistent to compare different classification methods, especially when margins of difference in performance are so small. 研究方法要保持一致Feature extraction is as crucial to predictive performance as the classification methods themselves, and deserves increased attention. 应该更多地关注特征提取Supplementing metagenomic data with human genetic data may be the best way to improve both classification performance and biological understanding, especially with hard-to-classify complex traits such as obesity and type 2 diabetes. 用人类遗传数据补充宏基因组数据会是比较好的方法，尤其是难以分类的复杂性状，如肥胖症和2型糖尿病。This is because genetic and metagenomic data provide complementary information about the host and environment, respectively. 遗传和宏基因组数据分别提供了宿主和环境的信息。

Introduction：

The human body is home to a highly complex and densely populated microbial ecosystem, the so-called “human microbiome”. 人体是一个高度复杂且人口稠密的微生物生态系统的家园，即所谓的“人类微生物组”。The advent of High Throughput Sequencing (HTS) has enabled the direct study of microbial environments, forming the rich field of metagenomics.Human Microbiome Projecttwo key questions about the microorganisms in a community: who is there, and what are they doing“metagenome-wide association studies” or MGWAS 全基因组关联研究These features can represent different aspects of the microbiome, for instance, taxonomic composition or functional profiles.The methods for extracting the features from the raw sequence data and the methods for predicting the disease based on the features are both important to the performance of the model.An important step forward in this effort was perhaps the first machine learning meta-analysis of publicly-available MGWAS data， used a method called MetaPhlAn2MetaPhlAn2：分析微生物群落(细菌、古菌、真核生物和病毒)组成的工具，它在宏基因组研究中非常有用，通过分析能获得微生物的物种丰度信息。we review recent methodological advancements in the prediction of disease from metagenomic data, with a particular focus on deep learning methods. These are discussed in Section 3. In Section 4, we present the reported results on the data from the Pasolli et al. meta-analysis, as it serves as a common basis for comparison among recent methods.In Section 5, we present an in-depth analysis of a type 2 diabetes dataset from the meta-analysis that has eluded improved results. We apply a number of machine learning methods, including an autoencoder-pretrained neural network that we developed, to the data, and also explore an alternate k-mer-based feature extraction method. In Section 6, we offer perspectives gained from the review, including considerations for study design and interpretability, and possible avenues to improve results in the future.

正文组织架构：

正文部分内容摘录：

Case 1: Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights

Google Scholar Citations:111they applied several well-known classical machine learning algorithms such as Support Vector Machines (SVMs) and Random Forests (RFs) to predict the patient’s disease status. These approaches performed well at predicting some patient diseases such as liver cirrhosis, colorectal cancer, and inflammatory bowel disease, but poorly on the others, such as type 2 diabetes and obesity

Case 2: Using machine learning to identify major shifts in human gut microbiome protein family abundance in disease

Google Scholar Citations: 11we use these 100 as a training set for a Random Forest classifier to determine ~5% the KEGGs which are best at separating disease and healthy states.

Case 3: Disease Classification in Metagenomics with 2D Embeddings and Deep Learning

Google Scholar Citations: 4 improve classification of various diseases with metagenomic data through the use of CNNs, Our results show significant improvements over the state-of-the-art algorithms (Random Forest (RF), Support Vector Machine (SVM)).

Case 4: Using convolutional neural networks to explore the microbiome 重点再看看这篇的引用文献

Google Scholar Citations: 10The proposed CNN framework is highly accurate, achieving a 99.47% of accuracy based on the evaluation on a dataset 1967 samples of three phenotypes.

Case 5: MetaNN: accurate classification of host phenotypes from metagenomic data using neural networks

Google Scholar Citations: 2we propose MetaNN (i.e., classification of host phenotypes from Metagenomic data using Neural Networks), a neural network framework which utilizes a new data augmentation technique to mitigate the effects of data over-fitting.

Case 6: Fizzy: feature subset selection for metagenomics 待定，但是好像有点意思

Google Scholar Citations: 20We have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats.

Case 7: Taxonomy-aware feature engineering for microbiome classification

Google Scholar Citations: 8We here propose the first algorithm to exploit phylogenetic hierarchy (i.e. an all-encompassing taxonomy) in feature engineering for microbiota classification.

Case 8: Environment dominates over host genetics in shaping human gut microbiota 跟机器学习无关，但是可以看看

Google Scholar Citations:386

Case 9: Data and Statistical Methods To Analyze the Human Microbiome

Google Scholar Citations:2

最新回复(0)