A Variable-Wise Hybrid Imputation Framework Using kNN, MissForest and PMM for Enhancing HIV Survey Data Quality in Kenya

Mino Subby Mackenzie *

Department of Mathematics, Physics and Computing, Moi University, Kenya.

Gregory Kerich

Department of Mathematics, Physics and Computing, Moi University, Kenya.

Robert Too

Department of Mathematics, Physics and Computing, Moi University, Kenya.

*Author to whom correspondence should be addressed.


Abstract

Missing data remains a critical challenge in large-scale public health datasets, particularly in HIV surveillance, where incomplete observations can bias estimates and weaken decision making. This study proposes a variablewise hybrid imputation framework that integrates k- Nearest Neighbors (kNN), MissForest, and a Modified Predictive Mean Matching (Modified PMM) under the Missing at Random (MAR) assumption. The method employs a composite scoring function to dynamically select optimal donors for each missing observation by combining structural similarity, predictive alignment, and model-based deviation. The framework was applied to HIV survey data (imbalance ratio 8.65:1) and evaluated against individual imputation methods using
both regression and classification as well as the distributional imputation quality metrics. The Hybrid approach achieved superior imputation accuracy, with the lowest RMSE (0.4297) and MAE (0.3623). It also demonstrated improved classification performance, achieving the highest accuracy (73.43%), specificity (0.7376), balanced accuracy (0.7201), and F1-score (0.3291). McNemar’s test confirmed statistically significant improvements over Modified PMM (p = 0.041), kNN (p = 0.020), and MissForest (p = 0.045). The Hybrid method further exhibited improved probability calibration, with a lower Expected Calibration Error (ECE = 0.2940). Precision-Recall analysis confirmed the Hybrid framework as the best-performing method under class imbalance, achieving the highest Area Under the Precision-Recall Curve (AUPRC = 0.3709), corresponding to a 4.01× lift over the random classifier baseline. An ablation study confirmed that the full three-component hybrid outperforms all two-component subsets and the equal-weights configuration, establishing that performance gains arise from the composite design rather than any single constituent. These findings highlight the effectiveness of adaptive, observation-level donor selection in improving imputation and downstream predictive performance under class imbalance.

Keywords: Missing data, hybrid imputation, k-nearest neighbors, MissForest, predictive mean matching, HIV surveillance, missing at random, classification performance, calibration, epidemiological data


How to Cite

Mackenzie, Mino Subby, Gregory Kerich, and Robert Too. 2026. “A Variable-Wise Hybrid Imputation Framework Using KNN, MissForest and PMM for Enhancing HIV Survey Data Quality in Kenya”. Asian Journal of Probability and Statistics 28 (5):87-104. https://doi.org/10.9734/ajpas/2026/v28i5897.

Downloads

Download data is not yet available.