Researcher: Sheena Phillip, University of the Witwatersrand, Johannesburg
Supervisor: Dr Wilbert Chagwiza, University of the Witwatersrand, Johannesburg

Fraud detection of health care providers is a growing concern worldwide as billions of dollars is lost each year. Medicare publicly released health provider data in order to encourage the development of models to overcome fraud. The aim of this research is to train a model using the Medicare dataset to determine with what accuracy the model can predict fraud and to identify the top 5 features which contribute the most towards fraud detection. It also aims to investigate the impact that explicit features such as Provider, BeneID and ClaimID have on the accuracy of the model. Four datasets were combined into a single comprehensive dataset and was subsequently used to train an XGBoost model. The model had accuracy of 0 98 with a recall score of 0 97 and performed extremely well overall. The model trained on the dataset excluding explicit features produced an accuracy of 0 85 and a recall score of 0 71. Comparatively, the model performed poorly with a 13 drop in accuracy. It is noted that regardless of which feature space was used, the top 5 features encompassed details about the doctor as well as the location of the hospital.