Learning of classification models often relies on data that are labeled/annotated by a human expert. In general, more expertise and time the labeling process requires, more costly it is to label the data. In addition, there may be constraints on how many data instances one expert can feasibly label. Our goal is to find ways of reducing the number of labels and at the same time preserve or improve the quality of the models based on such labels. In this talk, I present two solutions we have developed to address the above problems.
Biomarkers are objectively measured characteristics which are commonly used across a range of scientific disciplines for diagnosis, prognosis, and prediction, and potentially as surrogate measures for the actual clinical outcome. The utility of biomarker studies, however, is typically limited to evaluating associations rather than causal relationships.
Lung cancer is the leading cause of cancer death in the United States. Cigarette smoking causes 85% of lung cancer deaths, however only about 15% of smokers will develop lung cancer in their lifetime. Genetic variations can modify the effect of the exposure to cigarette smoke. Our lab studies genetic variations in enzymes that detoxify potent carcinogens from cigarette smoke (including nitrosamines such as NNK and NNAL). We have identified a whole gene deletion polymorphism in a carcinogen metabolizing gene (UGT2B17) that is associated with decreased carcinogen metab
Learning health care systems (LHCS) propose to advance health care by taking advantage of increasing efficiencies in data collection and analysis and information dissemination. A major premise of LHCS is broad and continuous access to patient data, extracted at the point of care, and stored and used primarily in de-identified form.
Now more than ever, electronic health records (EHRs) are generated in large quantities and diverse content. This explosion of information has naturally enabled powerful data analyses to potentially improve healthcare.
The main approaches for learning Bayesian networks can be classified as constraint-based, score-based or hybrid methods. Although high-dimensional consistency results are available for the constraint-based PC algorithm, such results have been lacking for score-based and hybrid methods, and most hybrid methods are not even proved to be consistent in the classical setting where the number of variables remains fixed.