Certificate Presentation - Using Rule Based Models to Predict Disease Status and Identify Novel Risk Combinations in Chronic Pancreatitis
Abstract: Chronic Pancreatitis (CP) is a complex disease with multiple overlapping etiologies, an unpredictable prognosis and few known risk factors. While new genotyping technologies have made a vast amount of data available for analysis, recent evidence has suggested that genetic risk factors for CP are specific to an individual’s genetic background, where multiple normally benign genetic variations are disease causing when expressed simultaneously in an individual, making novel gene discovery difficult. This project tests the viability of machine learning in a large CP cohort with two goals, first, to predict an individual’s risk for chronic pancreatitis and second, to identify novel combinations of risk factors in the development and progression of CP. We performed two analyses using a Bayesian rule learner (BRL) algorithm, with identical parameters but different input data from initial candidate gene analyses and phenotyping from the North American Pancreatitis Study (NAPS2) group. The first analysis compared all pancreatitis cases (n=911) to unrelated healthy controls (n=459) using 124 data elements, returning 531 rule sets of 39 selected elements with 95.7% sensitivity and 37% specificity. The second analysis compared heavy alcohol drinking adults with CP (n=193), recurrent acute pancreatitis (RAP, n=92), and no pancreatic disease (n=110) using 152 data elements, returning 3326 rule sets of 29 selected data elements returning 79.8% sensitivity and 73% specificity. Although not yet useful in a clinical setting, the balanced accuracies of rule sets (66.4% general, 76.4% alcohol) were higher than those of standard genetic screening (53.3% general, 52.5% alcohol). In addition to improved preliminary prediction statistics, the generated rule sets included physiologically relevant combinations of genetic and environmental factors. Confirming previous reports, heavy drinking and smoking are linked in CP development, in that current smoking status was a strong predictive element for CP or RAP in the heavy alcohol users and non- or ex-smoking status was associated with healthy heavy drinking controls, but not in the general cohorts. A novel combination of data elements included the known risk factor SPINK1 N34S; heterozygous status was a data element in conjunction with five other genes all involved in mediation of the inflammatory process. This report is proof of principle for the utility of rule learner algorithms for the prediction of CP and identification of novel genetic relationships.