CardioPredict AI
Explainable Heart Disease Prediction System
Chuka University · CDAM Research Team
Heart Disease AI
Prediction System
About This System
CardioPredict AI is a clinical decision-support platform for early detection of coronary artery disease using the Cleveland Heart Disease dataset. Three machine learning models — Logistic Regression, K-Nearest Neighbours, and Support Vector Machine — are compared on predictive accuracy, discrimination (AUC), and F1 score.
Explainable AI (SHAP) methods provide patient-level transparency , showing which clinical features drive each prediction — supporting, not replacing, clinician judgement.
- 13 clinical predictors including ECG, cholesterol, and chest pain type
- 3 ML algorithms with 5-fold cross-validated hyperparameter tuning
- SHAP explainability for global and patient-level feature attribution
- Risk stratification into four clinical action tiers
Risk Category Framework
Clinical Feature Dictionary
Model Performance Leaderboard
Models ranked by ROC-AUC on held-out test set.
Patient Input
Risk Probability Gauge
Clinical Recommendations
SHAP Feature Attribution (Patient-Level)
🔴 Red bars increase risk | 🟢 Green bars decrease risk
Model Metrics (Test Set)
Comparative Metrics Chart
Radar / Spider Chart
Confusion Matrices
ROC Curves — All Models
Model Ranking by AUC
Interpreting ROC Curves
What is a ROC Curve? A Receiver Operating Characteristic (ROC) curve plots True Positive Rate (Sensitivity) vs. False Positive Rate (1-Specificity) at all classification thresholds.
AUC (Area Under the Curve) measures overall discriminative ability. AUC = 1.0 = perfect; AUC = 0.5 = random.
Clinical Relevance: High sensitivity ensures fewer missed cases (false negatives), while high specificity reduces unnecessary referrals (false positives).
SHAP Settings
SHAP (SHapley Additive exPlanations) assigns each feature a contribution value showing how much it moved the prediction away from the baseline.
🔴 Positive SHAP : feature increases heart disease risk
🟢 Negative SHAP : feature decreases heart disease risk
Bars show each feature's SHAP contribution for this individual patient. Baseline (E[f(x)]) + all SHAP values = predicted probability.
Shows how SHAP value changes across the range of the selected feature. Steep gradients indicate regions of high influence.
CardioPredict AI Clinical Assistant
Hi. My name is Afya, an AI Chatbot specifically designed to help you with any questions you might have regarding cardiovascular disease. Ask questions about patient predictions, model performance, SHAP values, or cardiovascular risk factors.
Response:
Dataset Information
Source: Cleveland Heart Disease Dataset — UCI Machine Learning Repository (Detrano et al., 1989)
Sample Size: 303 patients; after NA removal ≈ 297 complete cases
Features: 13 clinical predictors spanning demographics, symptoms, vitals, ECG, and angiography
Target Variable: Binary — presence (Yes) or absence (No) of angiographic heart disease
Preprocessing Pipeline
- Missing Values: Removed rows with NA; median imputation applied in training pipeline via caret preProcess
- Outlier Handling: Extreme values retained — clinically plausible; Oldpeak and Chol winsorised implicitly via scaling
- Feature Encoding: Categorical variables factored; binary variables kept as 0/1 factors
- Scaling: Z-score standardisation (center + scale) applied within caret pipeline to all numeric features
- Train / Test Split: 80% training (stratified), 20% test; reproducible via set.seed(2024)
Machine Learning Models
📐 Logistic Regression
Models log-odds of the outcome as a linear combination of predictors. Produces calibrated probability outputs and interpretable coefficients. Strengths: fast, interpretable, minimal assumptions on features. Limitations: linear decision boundary may miss complex interactions.
🔍 K-Nearest Neighbours (KNN)
Classifies by majority vote of the k most similar training cases in feature space. Non-parametric; can capture non-linear boundaries. k tuned over {5,7,9,11,15}. Strengths: no training phase, flexible boundary. Limitations: sensitive to irrelevant features and scale; slow at inference.
⚙️ Support Vector Machine (SVM)
Finds the maximum-margin hyperplane separating classes; radial basis kernel (RBF) used to handle non-linear boundaries. Strengths: robust in high dimensions, effective with small samples. Limitations: computationally intensive; probabilities via Platt scaling.
Hyperparameter Tuning
Method: Grid Search over defined parameter grids via caret::train()
Validation: 5-fold cross-validation; folds stratified by outcome
Optimisation Metric: ROC-AUC (two-class summary); maximises discriminative ability
KNN: k ∈ {5, 7, 9, 11, 15} — optimal k selected by CV-AUC
SVM: C and σ tuned via tuneLength = 5 random search
Selection: Model with highest mean CV-AUC selected as final model; evaluated on held-out test set
Explainable AI — SHAP
Framework: SHapley Additive exPlanations (Lundberg & Lee, 2017)
Shapley Values: From cooperative game theory — each feature's contribution is its average marginal contribution across all possible feature orderings (coalitions)
Custom SHAP Sampler: Monte Carlo approximation of SHAP values using 30 simulations per row; implemented in base R (permutation sampling per Štrumbelj & Kononenko, 2014) so it works model-agnostically via a prediction function wrapper, with no external SHAP package dependency
Global Importance: Mean |SHAP| per feature — identifies clinically dominant predictors
Local Explanations: Per-patient SHAP waterfall shows which features pushed this individual's probability above or below the population baseline
Dependence Plots: Feature value vs. SHAP reveals non-linear effects and threshold behaviour
Technical Stack
🔵 R ≥ 4.2
shiny + bs4Dash — UI framework
📦 caret — ML training & CV
📦 e1071 + class — SVM & KNN
📦 pROC — ROC / AUC computation
📦 Custom SHAP sampler — base-R Monte Carlo Shapley approximation
ggplot2 + plotly — visualisation
📦 DT + reactable — interactive tables
📦 jsonlite — metrics & ROC serialisation
⚠️ Clinical Disclaimer
Predictions produced by this system are intended for research, educational, and decision-support purposes only . They should not replace professional clinical judgement, medical diagnosis, or treatment decisions. Always consult a qualified healthcare professional for individual patient management.
Contact & Citation
📧 CDAM Research Team, Chuka University
📅 Year: 2026
Dataset: Detrano R et al. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol, 64(5), 304-310.