Advances and Applications in Statistics
Volume 37, Issue 2, Pages 95 - 122
(December 2013)
|
|
EFFICIENT AND SCALABLE BAYESIAN STATISTICAL METHOD FOR IDENTIFYING CAUSAL RELATIONSHIPS FROM INTERVENTION STUDIES
Changwon Yoo and Erik M. Brilz
|
Abstract: To understand the physiology of genes from cells involved in a complex disease, it is necessary to learn the causal relationships between those genes. To this end, it is ideal to compare genetic experiments with complete interventions, e.g., gene knockouts, to those with no interventions. While conducting genetic experiments with complete interventions on animal cells, e.g., mouse cells, currently infeasible, when and if the technology becomes available, scientists will need established statistical methods to detect causal relationships in these cases. The results can then be verified in wetlab experiments.
In order to additionally examine other promising causal relationships that many current causal discovery algorithms are not guaranteed to visit, in this article, we introduce a novel extension – Equivalence checking Local Implicit latent variable scoring method with mixture of observational and intervention data (EquLIMmix) – to an existing causal Bayesian network discovery algorithm, the Local Implicit latent variable scoring Method (LIM). To avoid the possible problem of other algorithms either not detecting or incorrectly predicting causal relationships, for every structure visited during LIM’s structure search, EquLIMmix also visits and scores the same structure with all directed arcs reversed. We hypothesize that the new algorithm (EquLIMmix) will improve over LIM’s ability to detect causal relationships both from datasets mixing complete interventions with observational data.
We use LIM and EquLIMmix to analyze simulated datasets mixing a small number of complete interventions per gene with observational data. To test both algorithms’ abilities to detect causal relationships from realistic data, we generate the datasets from a gene regulation pathway model of malignant mesothelioma formation proposed by an expert. Using the metrics of Area Under Receiver Operating Characteristic (AUROC) curve, Positive Predictive Value (PPV), Negative Predictive Value (NPV), Accuracy, and Shannon Entropy, we show that EquLIMmix exhibits clear advantages over LIM with smaller datasets (with generally better performances for larger datasets). EquLIMmix therefore improves over LIM’s ability to detect causal relationships in gene networks both from small (< 50) mixture of observational and intervention data. |
Keywords and phrases: statistical genomics, Bayesian statistics. |
|
Number of Downloads: 376 | Number of Views: 1128 |
|