CardioPredict AI

Explainable Heart Disease Prediction System

Chuka University · CDAM Research Team

❤️

Heart Disease AI

Prediction System

About This System

CardioPredict AI is a clinical decision-support platform for early detection of coronary artery disease using the Cleveland Heart Disease dataset. Three machine learning models — Logistic Regression, K-Nearest Neighbours, and Support Vector Machine — are compared on predictive accuracy, discrimination (AUC), and F1 score.

Explainable AI (SHAP) methods provide patient-level transparency , showing which clinical features drive each prediction — supporting, not replacing, clinician judgement.

13 clinical predictors including ECG, cholesterol, and chest pain type
3 ML algorithms with 5-fold cross-validated hyperparameter tuning
SHAP explainability for global and patient-level feature attribution
Risk stratification into four clinical action tiers

Risk Category Framework

Low Risk

0.00 - 0.25

Mild Risk

0.26 - 0.50

Moderate Risk

0.51 - 0.75

High Risk

0.76 - 1.00

Clinical Feature Dictionary

Model Performance Leaderboard

Models ranked by ROC-AUC on held-out test set.

Patient Input

Select Model

👤 Demographics

Age (years)

Sex

🫀 Symptoms

Chest Pain Type

Exercise-Induced Angina

Vitals & Labs

Resting BP (mm Hg)

Cholesterol (mg/dl)

Fasting Blood Sugar > 120 mg/dl

📈 ECG & Stress Test

Resting ECG

Maximum Heart Rate

ST Depression (Oldpeak)

ST Slope

🔬 Angiography

Major Vessels (0–3)

Thalassemia

Risk Probability Gauge

Clinical Recommendations

SHAP Feature Attribution (Patient-Level)

🔴 Red bars increase risk | 🟢 Green bars decrease risk

Model Metrics (Test Set)

Comparative Metrics Chart

Radar / Spider Chart

Confusion Matrices

ROC Curves — All Models

Download Plot

Model Ranking by AUC

Interpreting ROC Curves

What is a ROC Curve? A Receiver Operating Characteristic (ROC) curve plots True Positive Rate (Sensitivity) vs. False Positive Rate (1-Specificity) at all classification thresholds.

AUC (Area Under the Curve) measures overall discriminative ability. AUC = 1.0 = perfect; AUC = 0.5 = random.

Clinical Relevance: High sensitivity ensures fewer missed cases (false negatives), while high specificity reduces unnecessary referrals (false positives).

SHAP Settings

Select Model

SHAP (SHapley Additive exPlanations) assigns each feature a contribution value showing how much it moved the prediction away from the baseline.

🔴 Positive SHAP : feature increases heart disease risk

🟢 Negative SHAP : feature decreases heart disease risk

‌

Mean Absolute SHAP Values

Features ranked by average absolute contribution to predictions.

SHAP Summary Beeswarm Plot

Each point is a patient. Red = high feature value, Blue = low feature value.

Select Patient (Test Set Index)

Bars show each feature's SHAP contribution for this individual patient. Baseline (E[f(x)]) + all SHAP values = predicted probability.

Select Feature

Shows how SHAP value changes across the range of the selected feature. Steep gradients indicate regions of high influence.

Dataset Explorer

Filter by Outcome

Download CSV

Distribution Explorer

Select Feature

Correlation Heatmap

CardioPredict AI Clinical Assistant

Hi. My name is Afya, an AI Chatbot specifically designed to help you with any questions you might have regarding cardiovascular disease. Ask questions about patient predictions, model performance, SHAP values, or cardiovascular risk factors.

Response:

Dataset Information

Source: Cleveland Heart Disease Dataset — UCI Machine Learning Repository (Detrano et al., 1989)

Sample Size: 303 patients; after NA removal ≈ 297 complete cases

Features: 13 clinical predictors spanning demographics, symptoms, vitals, ECG, and angiography

Target Variable: Binary — presence (Yes) or absence (No) of angiographic heart disease

Preprocessing Pipeline

Missing Values: Removed rows with NA; median imputation applied in training pipeline via caret preProcess
Outlier Handling: Extreme values retained — clinically plausible; Oldpeak and Chol winsorised implicitly via scaling
Feature Encoding: Categorical variables factored; binary variables kept as 0/1 factors
Scaling: Z-score standardisation (center + scale) applied within caret pipeline to all numeric features
Train / Test Split: 80% training (stratified), 20% test; reproducible via set.seed(2024)

Machine Learning Models

📐 Logistic Regression

Models log-odds of the outcome as a linear combination of predictors. Produces calibrated probability outputs and interpretable coefficients. Strengths: fast, interpretable, minimal assumptions on features. Limitations: linear decision boundary may miss complex interactions.

🔍 K-Nearest Neighbours (KNN)

Classifies by majority vote of the k most similar training cases in feature space. Non-parametric; can capture non-linear boundaries. k tuned over {5,7,9,11,15}. Strengths: no training phase, flexible boundary. Limitations: sensitive to irrelevant features and scale; slow at inference.

⚙️ Support Vector Machine (SVM)

Finds the maximum-margin hyperplane separating classes; radial basis kernel (RBF) used to handle non-linear boundaries. Strengths: robust in high dimensions, effective with small samples. Limitations: computationally intensive; probabilities via Platt scaling.

Hyperparameter Tuning

Method: Grid Search over defined parameter grids via caret::train()

Validation: 5-fold cross-validation; folds stratified by outcome

Optimisation Metric: ROC-AUC (two-class summary); maximises discriminative ability

KNN: k ∈ {5, 7, 9, 11, 15} — optimal k selected by CV-AUC

SVM: C and σ tuned via tuneLength = 5 random search

Selection: Model with highest mean CV-AUC selected as final model; evaluated on held-out test set

Explainable AI — SHAP

Framework: SHapley Additive exPlanations (Lundberg & Lee, 2017)

Shapley Values: From cooperative game theory — each feature's contribution is its average marginal contribution across all possible feature orderings (coalitions)

Custom SHAP Sampler: Monte Carlo approximation of SHAP values using 30 simulations per row; implemented in base R (permutation sampling per Štrumbelj & Kononenko, 2014) so it works model-agnostically via a prediction function wrapper, with no external SHAP package dependency

Global Importance: Mean |SHAP| per feature — identifies clinically dominant predictors

Local Explanations: Per-patient SHAP waterfall shows which features pushed this individual's probability above or below the population baseline

Dependence Plots: Feature value vs. SHAP reveals non-linear effects and threshold behaviour

Technical Stack

🔵 R ≥ 4.2

shiny + bs4Dash — UI framework

📦 caret — ML training & CV

📦 e1071 + class — SVM & KNN

📦 pROC — ROC / AUC computation

📦 Custom SHAP sampler — base-R Monte Carlo Shapley approximation

ggplot2 + plotly — visualisation

📦 DT + reactable — interactive tables

📦 jsonlite — metrics & ROC serialisation

⚠️ Clinical Disclaimer

Predictions produced by this system are intended for research, educational, and decision-support purposes only . They should not replace professional clinical judgement, medical diagnosis, or treatment decisions. Always consult a qualified healthcare professional for individual patient management.

Contact & Citation

📧 CDAM Research Team, Chuka University

📅 Year: 2026

Dataset: Detrano R et al. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol, 64(5), 304-310.