CardioPredict AI

Explainable Heart Disease Prediction System


Chuka University · CDAM Research Team

❤️

Heart Disease AI

Prediction System

About This System

CardioPredict AI is a clinical decision-support platform for early detection of coronary artery disease using the Cleveland Heart Disease dataset. Three machine learning models — Logistic Regression, K-Nearest Neighbours, and Support Vector Machine — are compared on predictive accuracy, discrimination (AUC), and F1 score.

Explainable AI (SHAP) methods provide patient-level transparency , showing which clinical features drive each prediction — supporting, not replacing, clinician judgement.

  • 13 clinical predictors including ECG, cholesterol, and chest pain type
  • 3 ML algorithms with 5-fold cross-validated hyperparameter tuning
  • SHAP explainability for global and patient-level feature attribution
  • Risk stratification into four clinical action tiers

Risk Category Framework

Low Risk
0.00 - 0.25
Mild Risk
0.26 - 0.50
Moderate Risk
0.51 - 0.75
High Risk
0.76 - 1.00

Clinical Feature Dictionary

Model Performance Leaderboard

Models ranked by ROC-AUC on held-out test set.

Patient Input


👤 Demographics
🫀 Symptoms
Vitals & Labs
📈 ECG & Stress Test
🔬 Angiography

Risk Probability Gauge

Clinical Recommendations

SHAP Feature Attribution (Patient-Level)

🔴 Red bars increase risk | 🟢 Green bars decrease risk

Model Metrics (Test Set)

Comparative Metrics Chart

Radar / Spider Chart

Confusion Matrices

ROC Curves — All Models

Model Ranking by AUC

Interpreting ROC Curves

What is a ROC Curve? A Receiver Operating Characteristic (ROC) curve plots True Positive Rate (Sensitivity) vs. False Positive Rate (1-Specificity) at all classification thresholds.

AUC (Area Under the Curve) measures overall discriminative ability. AUC = 1.0 = perfect; AUC = 0.5 = random.

Clinical Relevance: High sensitivity ensures fewer missed cases (false negatives), while high specificity reduces unnecessary referrals (false positives).

SHAP Settings


SHAP (SHapley Additive exPlanations) assigns each feature a contribution value showing how much it moved the prediction away from the baseline.

🔴 Positive SHAP : feature increases heart disease risk

🟢 Negative SHAP : feature decreases heart disease risk

Mean Absolute SHAP Values
Features ranked by average absolute contribution to predictions.

SHAP Summary Beeswarm Plot
Each point is a patient. Red = high feature value, Blue = low feature value.

Bars show each feature's SHAP contribution for this individual patient. Baseline (E[f(x)]) + all SHAP values = predicted probability.

Shows how SHAP value changes across the range of the selected feature. Steep gradients indicate regions of high influence.

Dataset Explorer

Distribution Explorer

Correlation Heatmap

CardioPredict AI Clinical Assistant

Hi. My name is Afya, an AI Chatbot specifically designed to help you with any questions you might have regarding cardiovascular disease. Ask questions about patient predictions, model performance, SHAP values, or cardiovascular risk factors.


Response:

Loading...

Dataset Information

Source: Cleveland Heart Disease Dataset — UCI Machine Learning Repository (Detrano et al., 1989)

Sample Size: 303 patients; after NA removal ≈ 297 complete cases

Features: 13 clinical predictors spanning demographics, symptoms, vitals, ECG, and angiography

Target Variable: Binary — presence (Yes) or absence (No) of angiographic heart disease

Preprocessing Pipeline

  1. Missing Values: Removed rows with NA; median imputation applied in training pipeline via caret preProcess
  2. Outlier Handling: Extreme values retained — clinically plausible; Oldpeak and Chol winsorised implicitly via scaling
  3. Feature Encoding: Categorical variables factored; binary variables kept as 0/1 factors
  4. Scaling: Z-score standardisation (center + scale) applied within caret pipeline to all numeric features
  5. Train / Test Split: 80% training (stratified), 20% test; reproducible via set.seed(2024)

Machine Learning Models

📐 Logistic Regression

Models log-odds of the outcome as a linear combination of predictors. Produces calibrated probability outputs and interpretable coefficients. Strengths: fast, interpretable, minimal assumptions on features. Limitations: linear decision boundary may miss complex interactions.


🔍 K-Nearest Neighbours (KNN)

Classifies by majority vote of the k most similar training cases in feature space. Non-parametric; can capture non-linear boundaries. k tuned over {5,7,9,11,15}. Strengths: no training phase, flexible boundary. Limitations: sensitive to irrelevant features and scale; slow at inference.


⚙️ Support Vector Machine (SVM)

Finds the maximum-margin hyperplane separating classes; radial basis kernel (RBF) used to handle non-linear boundaries. Strengths: robust in high dimensions, effective with small samples. Limitations: computationally intensive; probabilities via Platt scaling.

Hyperparameter Tuning

Method: Grid Search over defined parameter grids via caret::train()

Validation: 5-fold cross-validation; folds stratified by outcome

Optimisation Metric: ROC-AUC (two-class summary); maximises discriminative ability

KNN: k ∈ {5, 7, 9, 11, 15} — optimal k selected by CV-AUC

SVM: C and σ tuned via tuneLength = 5 random search

Selection: Model with highest mean CV-AUC selected as final model; evaluated on held-out test set

Explainable AI — SHAP

Framework: SHapley Additive exPlanations (Lundberg & Lee, 2017)

Shapley Values: From cooperative game theory — each feature's contribution is its average marginal contribution across all possible feature orderings (coalitions)

Custom SHAP Sampler: Monte Carlo approximation of SHAP values using 30 simulations per row; implemented in base R (permutation sampling per Štrumbelj & Kononenko, 2014) so it works model-agnostically via a prediction function wrapper, with no external SHAP package dependency

Global Importance: Mean |SHAP| per feature — identifies clinically dominant predictors

Local Explanations: Per-patient SHAP waterfall shows which features pushed this individual's probability above or below the population baseline

Dependence Plots: Feature value vs. SHAP reveals non-linear effects and threshold behaviour

Technical Stack

🔵 R ≥ 4.2

shiny + bs4Dash — UI framework

📦 caret — ML training & CV

📦 e1071 + class — SVM & KNN

📦 pROC — ROC / AUC computation

📦 Custom SHAP sampler — base-R Monte Carlo Shapley approximation

ggplot2 + plotly — visualisation

📦 DT + reactable — interactive tables

📦 jsonlite — metrics & ROC serialisation

⚠️ Clinical Disclaimer

Predictions produced by this system are intended for research, educational, and decision-support purposes only . They should not replace professional clinical judgement, medical diagnosis, or treatment decisions. Always consult a qualified healthcare professional for individual patient management.

Contact & Citation

📧 CDAM Research Team, Chuka University

📅 Year: 2026


Dataset: Detrano R et al. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol, 64(5), 304-310.