Statistics and Its Interface

Volume 8 (2015)

Number 2

Special Issue on Modern Bayesian Statistics (Part II)

Guest Editor: Ming-Hui Chen (University of Connecticut)

A Bayesian approach to identify genes and gene-level SNP aggregates in a genetic analysis of cancer data

Pages: 137 – 151



Francesco C. Stingo (Department of Biostatistics, MD Anderson Cancer Center, Houston, Texas, U.S.A.)

Michael D. Swartz (Department of Biostatistics, UT School of Public Health, University of Texas, Houston, Tx., U.S.A.)

Marina Vannucci (Department of Statistics, Rice University, Houston, Texas, U.S.A.)


Complex diseases, such as cancer, arise from complex etiologies consisting of multiple single-nucleotide polymorphisms (SNPs), each contributing a small amount to the overall risk of disease. Thus, many researchers have gone beyond single-SNPs analysis methods, focusing instead on groups of SNPs, for example by analysing haplotypes. More recently, pathway-based methods have been proposed that use prior biological knowledge on gene function to achieve a more powerful analysis of genome-wide association studies (GWAS) data. In this paper we propose a novel Bayesian modeling framework to identify molecular biomarkers for disease prediction. Our method combines pathway-based approaches with multiple SNP analyses of a specified region of interest. The model’s development is motivated by SNP data from a lung cancer study. In our approach we define gene-level scores based on SNP allele frequencies and use a linear modeling setting to study the scores association to the observed phenotype. The basic idea behind the definition of gene-level scores is to weigh the SNPs within the gene according to their rarity, based on genotype frequencies expected under the Hardy-Weinberg equilibrium law. This results in scores giving more importance to the unusually low frequencies, i.e. to SNPs that might indicate peculiar genetic differences between subjects belonging to different groups. An additional feature of our approach is that we incorporate information on SNP-to-SNP associations into the model. In particular, we use network priors that model the linkage disequilibrium between SNPs. For posterior inference, we design a stochastic search method that identifies significant biomarkers (genes and SNPs) for disease prediction. We assess performances on simulated data and compare results to existing approaches. We then show the ability of the proposed methodology to detect relevant genes and associated SNPs in a lung cancer dataset.


Bayesian variable selection, Hardy-Weinberg equilibrium law, linear models, linkage disequilibrium, Markov random field, SNP data

Full Text (PDF format)