Machine Learning Project

Health Insurance Premium Predictor

Machine learning–based premium estimation with real-world risk segmentation

Domain / Function

Healthcare Analytics & Predictive Modeling

Project Overview

Developed a machine learning system to accurately predict health insurance premiums based on customer demographics, lifestyle factors, and medical history. The solution addresses non-linear relationships that traditional pricing methods fail to capture.

The project follows an end-to-end ML pipeline including data cleaning, feature engineering, model training, error analysis, and deployment via a Streamlit web application for real-time predictions.

Key Features

  • High-accuracy premium prediction (R² ≈ 0.98)
  • Advanced feature engineering using normalized risk scores
  • Age-based model segmentation for improved accuracy
  • Error analysis with residual and percentage deviation tracking
  • Model retraining using additional genetic risk features
  • Interactive Streamlit application for real-time use

Project Details

Multiple models including Linear Regression, Ridge Regression, and XGBoost were trained and compared. Initial error analysis revealed high deviation in younger age groups, leading to segmentation-based retraining.

After introducing a genetic risk feature, extreme prediction errors were reduced from 73% to just 2%, resulting in a reliable and explainable production-ready solution.

Technologies Used

Python Pandas Scikit-learn XGBoost Streamlit Matplotlib