Statistics and Its Interface
Volume 8 (2015)
Estimation of gene co-expression from RNA-Seq count data
Pages: 507 – 515
Gene coexpression networks are widely used in understanding gene regulations, inferring gene functions, etc. The most straightforward way of constructing a coexpression network is to connect gene pairs whose expressions are highly correlated under different experimental conditions. Usually, this correlation is measured by the Pearson’s correlation coefficient, which, however, does not directly apply to data generated from RNA-Seq technique. RNA-Seq data are non-negative integers which cannot be properly modeled by a Gaussian distribution, and moreover, these counts have mean values that are proportional to the sequencing depths, and thus there are no identically distributed “replicates.” Directly normalizing counts by the corresponding sequencing depths and then using Pearson’s correlation coefficient can be of low efficiency. We propose a generalization of the Pearson’s correlation coefficient called iCC that can be directly applied to RNA-Seq data. On simulation data, iCC shows higher efficiency in distinguishing coexpressed gene pairs from unrelated gene pairs. In a real dataset, iCC generates a coexpression network that appears to more closely agree with experimentally validated networks than other methods. More generally, iCC can be used for calculating the correlation coefficient for any two series of random variables.
Pearson’s correlation coefficient, RNA-Seq, coexpression network, count data, robust estimate