Exploiting the latent structure of gene expression data
Genome scale molecular datasets are often highly structured, with many correlated observations. This general phenomenon can be related to the underlying data generating process. In gene expression assays, groups of gene are co-regulated through shared transcription factors and signaling pathways. In the first half of the talk we will present a new constrained matrix decomposition approach that directly aligns a lower dimension representation with known biological pathways. Our method provides state-of-the-art accuracy in reconstructing known upstream variables through a biologically interpretabile decomposition. In the second half of the talk we will show how insight about the structure of gene expression can be used to derive new data representations that are optimized for specific inference tasks. We present a new method, DataRemix, which is a simple 3 parameter transformation that can be dramatically improved the utility of gene expression datasets. We show that DataRemix can outperform methods that make explicit use of dataset specific technical factors, and can further improve datasets that have been extensively normalized via an optimized, parameter rich model.