Statistics and Its Interface

Volume 8 (2015)

Number 4

Detecting bacterial genomes in a metagenomic sample using NGS reads

Pages: 477 – 494



Camilo Valdes (Center for Computational Sciences, University of Miami, Florida, U.S.A.)

Meghan Brennan (Department of Anesthesiology, University of Florida, Gainesville, Florida, U.S.A.)

Bertrand Clarke (Department of Statistics, University of Nebraska, Lincoln, Neb., U.S.A.)

Jennifer Clarke (Department of Food Science and Technology, University of Nebraska, Lincoln, Neb., U.S.A.)


We use a nucleotide flipping technique on whole genome next generation sequencing (NGS) data to test for the presence of various bacterial strains in a single metagenomic sample. Our technique is novel in that we induce artificial point mutations at the nucleotide level to define a test statistic for each genome on a given reference list. After finding a suitable nucleotide flipping rate, we use a variant of the Westfall-Young procedure to correct for multiple comparisons. When we align reads to reference genomes we permit fractional reads i.e., we weight the contribution of each read by one over the number of genomes to which it aligns. In a large scale simulation we characterize our method’s performance on “clean” data with respect to accuracy, genome lengths and genome abundances. Then, we apply our technique to real data from the Human Microbiome Project (HMP). We compare our results based on adjusted $p$-values with the HMP findings based on abundance, as assessed by coverage. The results from the two methods have substantial overlap; discrepancies can be explained by the inherent variability of the respective processing pipelines and data.


metagenomics, next-generation sequencing, human microbiome project, multiple comparisons, nucleotide flipping, artificial point mutations

2010 Mathematics Subject Classification

Primary 62G10, 62P10. Secondary 62-07.

Published 19 October 2015