Statistical Analysis of Heart Diseases Risk Factors Using Logistic Regression: Evidence from the Cleveland Dataset

Umar Usman

Department of Statistics, Usmanu Danfodiyo Sokoto, Sokoto, Nigeria.

A. A. Nurudeen

Department of Statistics, Usmanu Danfodiyo Sokoto, Sokoto, Nigeria.

M. A. Balarabe *

Department of Statistics, Usmanu Danfodiyo Sokoto, Sokoto, Nigeria.

*Author to whom correspondence should be addressed.


Abstract

In low- and middle-income countries, including Nigeria, the burden of cardiovascular diseases (CVDs) is rising rapidly because of urbanisation, changing dietary patterns, physical inactivity and limited access to advanced diagnostic facilities. This study aimed to identify and quantify the major risk factors for heart disease using logistic regression and to evaluate the predictive performance of the model on the Cleveland Heart Disease Dataset. Secondary data were obtained from Cleveland Hospital through the UCI Machine Learning Repository. The Cleveland Heart Disease Dataset (n = 303 patients, 14 attributes) was used. Missing values were handled by median imputation, categorical variables were label-encoded, and continuous variables were standardised. Logistic regression was applied after a 70/30 train-test split. Model performance was assessed using accuracy, ROC-AUC, a confusion matrix and odds ratios. The analysis was performed in Python (scikit-learn). The logistic regression model achieved an accuracy of 84.62% and a ROC-AUC of 0.9046, indicating excellent discriminative ability. The number of major vessels coloured by fluoroscopy (ca), thalassemia type (thal), exercise-induced angina (exang), chest pain type (cp), and serum cholesterol were the strongest predictors. Odds ratios showed that each additional vessel with blockage increased the odds of heart disease by more than 2.5 times. Logistic regression provides an interpretable and clinically useful approach for heart disease risk prediction. The identified risk factors align with established medical knowledge, supporting the validity of the model. Its transparency makes logistic regression valuable in resource-constrained settings where explainable models are preferred over black-box algorithms.

Keywords: Heart disease, cardiovascular risk, logistic regression, Cleveland dataset, odds ratio, predictive modelling, ROC-AUC, risk factors, clinical decision support, model interpretability


How to Cite

Usman, Umar, A. A. Nurudeen, and M. A. Balarabe. 2026. “Statistical Analysis of Heart Diseases Risk Factors Using Logistic Regression: Evidence from the Cleveland Dataset ”. Asian Journal of Probability and Statistics 28 (7):60-70. https://doi.org/10.9734/ajpas/2026/v28i7918.

Downloads

Download data is not yet available.