Study of integrated heterogeneous data reveals prognostic power of gene expression for breast cancer survival
Studies show that thousands of genes are associated with prognosis of breast cancer. Towards utilizing available genetic data, efforts have been made to predict outcomes using gene expression data, and a number of commercial products have been developed. These products have the following shortcomings: 1) They use the Cox model for prediction. However, the RSF model has been shown to significantly outperform the Cox model. 2) Testing was not done to see if a complete set of clinical predictors could predict as well as the gene expression signatures.
We address these shortcomings. The METABRIC data set concerns 1981 breast cancer tumors. Features include 21 clinical features, expression levels for 16,384 genes, and survival. We compare the survival prediction performance of the Cox model and the RSF model using the clinical data and the gene expression data to their performance using only the clinical data. We obtain significantly better results when we used both clinical data and gene expression data for 5 year, 10 year, and 15 year survival prediction. When we replace the gene expression data by PAM50 subtype, our results are significant only for 5 year and 15 year prediction. We obtain significantly better results using the RSF model over the Cox model. Finally, our results indicate that gene expression data alone may predict long-term survival.
Our results indicate that we can obtain improved survival prediction using clinical data and gene expression data compared to prediction using only clinical data. We further conclude that we can obtain improved survival prediction using the RSF model instead of the Cox model. These results are significant because by incorporating more gene expression data with clinical features and using the RSF model, we could develop decision support systems that better utilize heterogeneous information to improve outcome prediction and decision making.