IJFANS International Journal of Food and Nutritional Sciences

ISSN PRINT 2319 1775 Online 2320-7876

A Machine Learning Approach for Diabetes Prediction in Women

Main Article Content

Afshan Hashmi1, Md Tabrez Nafis2,*, Sameena Naaz3, Imran Hussain4

Abstract

Diabetes is one of the diseases that are chronic and has seen exponential growth in the recent past. Trends suggest that the number of patients suffering from this disease is going to be doubled very soon which is a cause of serious concern and it needs to be tackled at the earliest. The reason why it is considered a chronic disease is that it is the cause of several other serious diseases such as hypertension, kidney failure, blindness, limb amputation, etc. So, it is highly required to predict diabetes as early as possible to protect the patient from further damage. Machine learning can be proven as a beneficial tool for the prediction of diabetes. In this study, we have taken the PIMA India dataset, dropped the highly correlated feature, and filled the missing value by KNN imputation. Inter Quartile range was used to get rid of the outliers and Adaptive synthetic sampling was used for class balancing and min-max scaler for normalizing the dataset. Eight machine learning algorithms were used named Support vector classifier, Logistic regression, Naïve Bayes, Decision Tree, Xtreme gradient boosting,K-nearest neighbor, Linear discriminant analysis, and Random Forest.These algorithms were compared based on various performance metrics such as Accuracy, Precision, Recall, F1-score, and Auc-Roc curve. It was found that the linear discriminant analysis and Xtreme gradient boosting was the best performer in terms of accuracy followed by Random Forest, Logistic regression, K nearest neighbor, support vector classifier, and naïve Bayes. The decision tree however showed poor performance. The effect of oversampling on the result was also analyzed and it was found that oversampling enhances the precision and F1 score of all the algorithms but decision tree. Performance can be further improved by using a larger dataset with no or negligible missing values or with a dataset with some additional features such as lifestyle, calorie intake, etc.

Article Details