Predicting Unemployment Insurance Improper Payment Rates in the United States with Machine Learning


This project is to predict improper payment of unemployment insurance benefits based on overpayment rate, underpayment rate, and fraud rate. Data used was accessed from the US Department of Labor website.  The dataset shows the amount paid in benefits, amounts overpaid and underpaid, overpayment and underpayment rates, improperly paid rate, the amount overpaid excluding work search, fraud rate for each state in the United States over the third quarter of 2020 through the first quarter of 2021. Since the focus of the project is to predict improper payment rates, the dataset was cleaned to show the payment rates as feature variables and States as the independent variables. The response variable, which is the improper payment rate, is predicted to yield numerical results based off of the overpayment rate, underpayment rate, and fraud rate. For this predictive analysis, supervised machine learning regression techniques were used. 

Necessary package libraries including Pandas, NumPy, Scikit-learn, and Matplotlib were loaded in a python environment on Jupyter Notebook for this analysis. Dataset was preprocessed, and exploratory data analysis was carried out with visualizations. A scatter plot was used to display improper payment rates for each state.

The scatterplot shows Virginia as having the highest unemployment benefit payment inaccuracies at a 0.47 rate followed closely by Tennessee at 0.456. Hawai has the least rate at 0.04 followed by Kentucky at 0.06. A barplot was used to visualize the improper payment and fraud rates for each state.

The barplot shows Kansas as having the highest unemployment payment fraud rate of 0.3 with an improper payment rate of 0.33. Rhode Island is the next state with a high fraud rate of 0.17 also with an increased improper payment rate of  0.27. New Hampshire has the least fraud rate at 0.002 with an improper payment rate of 0.19. The value of the fraud rate and improper payment rate for each state is a strong indicator of the direct relationship between the two variables.

For the predictive analysis, the dataset was split 80/20 for training and testing respectively. Simple Linear, K Nearest Neighbors (KNN), and Random Forest regression models were used to analyze and compare performances. The simple linear regression modeling predicted the results of the test data with very good accuracy. A comparison between the actual and prediction results of the KNN and Random Forest models was visualized using matplotlib.



The plot above visualizes the actual and prediction result of the KNN model. The model yielded an actual R2 score and MSE of 0.86 and 0.35 respectively. The test values of the GridSearchCV hyperparameter tuning tool were used to improve the accuracy of the model as shown in the plot below.


A comparison of the actual and predicted values of the data using the random forest regression technique was tabulated and plotted.





The curves of the two models show that the two regression techniques were able to predict the values of the test data with good accuracy as well as the simple regression technique. The slight difference between the actual and predicted result of the models strongly indicates that the models predict new observations as well as it fits the data therefore an inference can be made that unemployment insurance improper benefits payment rate can be predicted based on rates of overpayment, underpayment, and fraud associated with unemployment insurance benefits claims.








Comments

Popular posts from this blog

Impact of Expanded Unemployment Benefits on Texas Unemployment Rate after lift of COVID restrictions

Data analysis of US imports from West Africa: 2012 to 2021