Skip to main content
TSRC, Tob. Sci. Res. Conf., 2019, 73, abstr. 048 (also presented at CORESTA SSPT2019)

A blood-based smoking-related gene expression signature using a machine learning approach

LIU GANG M.; PRASAD G.L.
RAI Services Company, Winston Salem, NC, USA

Smoking is a leading risk factor in the onset of multiple forms of cancer, chronic obstructive pulmonary disease, and cardiovascular disease. At present, there is a limited understanding by which changes in gene expression profiles in blood or other tissues can be used to predict smoking status. In this study, we investigated whether a machine learning approach could provide an unbiased method to predict smoking status using microarray expression profiles obtained from the blood. Using multiple feature selection and classification methods, the most optimal algorithm that produced the best predictive model to determine smoking status was a combination of Support Vector Machine (SVM), based on Recursive Feature Elimination (RFE). The 16 gene signature from our machine learning model included not only three previously reported genes (LNNR3, SASH1, and GPR15), but also several newly identified genes including GZMM, which has been reported to be associated with lung adenocarcinoma. In addition, this gene signature has been validated by seven independent publicly available gene expression datasets. In summary, we show that machine learning analysis using expression profiling datasets from blood is useful in ascertaining smoking status and in developing novel Biomarkers of Potential Harm.