Holly Cui
by Holly Cui

Tags

  • Data Wrangling
  • Machine Learning
  • Statistical Analysis
  • Statistical Computing

This project was completed with Yihan Shi, Medy Mu, and Edna Zhang for the Fall 2023 section of STA 325: Machine Learning & Data Mining at Duke University. The updated version was reviewed and revised by me.

About

Medical and healthcare services are one of the inelastic demand in the modern society. However, personal medical costs have been rising rapidly and becoming a major public health issue. In 2020, health expenditure constitutes of 19.7% of US GDP. Understanding the factors affecting health outcomes and costs as well as predicting future expenditure is essential to many parties involved in the ecosystem of healthcare. These knowledge can aid risk management and Value at Risk (VaR) based pricing strategies for medical insurance companies, drive more informed decision-making process for healthcare policy makers, and educate individual about healthy lifestyle choices. According to various research studies, factors that influence personal medical care costs include obesity (Cawley et al., 2021), ageing (Alemayehu et al., 2004), smoking (Hall et al., 2016), etc. Specifially on smoking habit, in the issue of PLOS Medicine, Lightwood and Glantz have estimated how much, on average, a 1% reduction in smoking prevalence in a US state was associated with reduced health costs in that state a year later (Lightwood, 2016). Moreover, research about health care costs shows wide variations in spending across the U.S., with spending more than three times higher in some regions than others (KFF Family Foundation, 2009). Patients in higher cost areas were not necessarily receiving better care; rather, the cost variations were explained by the availability and volume of services used by similar patients. More interesting factors of influence are found in recent year, such as single women spend nearly twice as much of their income on health insurance as single men (Festa, 2022), suggesting gender, potential marital status, and number of dependents as important considerations when deciding health insurance cost.

In this report, we seek to explore the gathered influencing factors, such as demographics, personal habits, and household situations, and their potential interactions to predict the individual medical cost billed by health insurance annually. Conducting various types of regression fit and model selection, we attempt to interpret the relative influence of each factors based on the model output, controlling for other factors. Using more complex models like generalized additive model (GAM), we can extrapolate new data beyond the scope of the known observations and answer question like “what is the likely range of the medical costs of 60-year-old females living in southwest US who smokes with high BMI?” These results will help insurance companies classify their clients and pricing and policy-makers to attend to and potentially intervene certain high-risk populations.

Data

We used a public medical insurance dataset that was published in the book Machine Learning with R by Brett Lantz in 2013 and later updated on Kaggle in 2017. The dataset consists of 1338 observations and covers 7 variables, including the amount of medical cost billed by health insurance and potentially important factors related to the charges. There are no missing data in any of these columns.