Diabetes prediction and validation model using ML classification algorithms

Subhrapratim Nath, Indrajit Das, Pradyut Nath, Sumagna Dey, Dyuti Mohapatra


Diabetes is now a global wide concern, which can critically impact and disrupt the normal lifestyle and the everyday activities of any individual. Due to the lack of insulin and high glucose content in the body, anyone can get diagnosed with diabetes. Apart from all the medical factors, there are few additional non-medical factors in an individual’s daily life like hypertension, heredity, daily standard activity, smoking habits, body mass index etc. that might play a part in triggering diabetes. Several medical studies reveal that for women sometimes pregnancy frequencies or any kind of heart issues can also trigger diabetes. The paper aims to predict the most critical factor that contributes in triggering diabetes in any individual by using classification and predictive analysis algorithms. Five well known machine learning classification algorithms are used where a filtering scheme based on 75% threshold accuracy rate is employed followed by verification using AUROC metric aiming low error rate and high prediction accuracy. Additionally, the model used Ensemble learning to make predictions and validates the proposed scheme against PIMA Indian Diabetes dataset.


logistic regression; random forest algorithm; support vector machine; naïve bayes; KNN, AUROC; ensemble learning

Full Text:



P. Suresh Kumar and V. Umatejaswi, “Diagnosing Diabetes using Data Mining Techniques”, International Journal of Scientific and Research Publications, Vol 7, Issue 6, June 2017.

A.Swain, S. N . Mohanty, A.C . Das “Comparative Risk Analysis on Prediction of Diabetes Mellitus using machine learning approach”, International Conference on Electrical , Electronics and Optimization Techniques (ICEEOT) – 2016.

W. Xu, J. Zhang, Q. Zhang, X. Wei,“Risk Prediction of type II diabetes based on random forest model”, 3rd International Conference on Advances in Electrical, Electronics, Information, Communication and Bio – Informatics (AEEICB17), 2017.

L. O. Griva, M. S Basualdo, “Evaluating clinical accuracy of models for predicting glycemic behavior for diabetes care”, Argentine Conference on Automatic Control (AADECA), 2018.

J. He, T. He, Y. Wang, “Blood Glucose Concentration Prediction based on Canonical Correlation Analysis”, 38th Chinese Control Conference, July, 2019.

C-Y. J Peng, K.L Lee, G.M. Ingersoll, “ An introduction to logistic regression analysis and reporting”, The International of Education Research, Vol.96, Issue. 1, 2002.

N. Cristianini and J Shawe-Taylor, 2000 “An introduction to support vector machines: and other kernel-based learning methods”,Cambridge university press.

P.Kaviani, S. Dhotre, “ Short survey on Naïve Bayes Algorithm”, International Journal of Advance Research in Computer Science and Management • November 2017.

G. Biau, “ Analysis of a Random forests model”, Journal of Machine Learning Research 13 (2012) 1063-1095.

Y-L. Cai, D. Ji, D-F. Cai, “ A KNN research paperclassification method based on shared nearest neighbor”, Proceedings of NTCIR-8 Workshop Meeting, June 15–18, 2010.

DOI: https://doi.org/10.26483/ijarcs.v11i5.6654


  • There are currently no refbacks.

Copyright (c) 2020 International Journal of Advanced Research in Computer Science