Predicting 6-Figure Salaries with Kaggle's 2018 Survey Data
Objective
Kaggle is a forum for data scientists and other developers to participate in data science contests, write and share code, and to host datasets. They recently posted the raw results of their 2018 Machine Learning and Data Science Survey. If you are interested in the data, you can find it here.
Because Kaggle's user-base is so broad (data scientists, analysts, DBAs, sales, CXOs, etc.), I thought it would be interesting to build a classification/prediction model to predict what combination of variables predict a six-figure salary for someone with a data career. For anyone who may be pursuing a data career, this analysis could help in determining a career path.
In this article, I will focus less on my process and more on the results. If you are interested in the process and R code used to prepare, analyze, visualize and model the data - you can find it here. Between the narrative and comments in the code, it's a pretty good tutorial. If you are only interested in the results... read on.
Scope
I chose to limit the dataset to respondents from the US, who were not students, not unemployed, and having at least a Bachelor's degree. The reason for limiting to US was to eliminate country-specific economic situations that I may not understand. I also wanted to maximize the utility for someone pursuing a career in the US. Students and unemployed respondents were obviously outside of the scope of this project. I originally included all degree levels (including none), but there were minimal observations outside of bachelor's, master's, and doctoral degrees, and they did not contribute to the research or model in a meaningful way. The final dataset was 2,696 observations.
Salaries
Before jumping into the classification/prediction model, we should first look at the overall distribution of salaries in the US, and then how that splits between those who do/don't earn six-figures. In the first chart you can see the data is fairly normally distributed, with the peak at the $100-150k mark. A plurality of the respondents fall between $100k and $200k.
Now let's roll up the salaries that are above and below $100k to see how the data is distributed between those earning six-figures and those not. In the figure below, you can see it's about a 50/50 split. This is a perfect classification model challenge, to see if we can improve upon a coin flip.
Decision Tree
Below is the decision tree plot representing the predicted probability of earning a six-figure salary. This is based on Role, Education, and Experience. Of all 50+ variables in the original dataset, these 3 features contributed to the strongest model. The accuracy of the trained model was ~70% when tested against the test dataset. Much better than a coin flip! The tree plot is below, followed by a more detailed explanation.
Probability of earning six-figures is...
22% when Role is: Chief Officer or Consultant or Data Engineer or Data Journalist or Data Scientist or DBA/DB Eng. or Developer Advocate or Manager or Other or Principal Investigator or Prod/Proj Mgr or Salesperson or Software Engineer
& Experience is: 0-1 or 1-2 or 3-4
& Education is: Bachelor’s degree
22% when Role is: Business Analyst or Data Analyst or Marketing Analyst or Research Assistant or Research Scientist or Statistician
48% when Role is: Chief Officer or Consultant or Data Engineer or Data Journalist or Data Scientist or DBA/DB Eng. or Developer Advocate or Manager or Other or Principal Investigator or Prod/Proj Mgr or Salesperson or Software Engineer
& Experience is: 0-1 or 1-2 or 3-4
& Education is: Master’s degree
71% when Role is: Chief Officer or Consultant or Data Engineer or Data Journalist or Data Scientist or DBA/DB Eng. or Developer Advocate or Manager or Other or Principal Investigator or Prod/Proj Mgr or Salesperson or Software Engineer
& Experience is: 0-1 or 1-2 or 3-4
& Education is: Doctoral degree
73% when Role is: Chief Officer or Consultant or Data Engineer or Data Journalist or Data Scientist or DBA/DB Eng. or Developer Advocate or Manager or Other or Principal Investigator or Prod/Proj Mgr or Salesperson or Software Engineer
& Experience is: 2-3 or 4-5 or 5-10 or 10-15 or 15-20 or 20-25 or 25-30 or 30 +
Visualizing the Significant Model Variables
Conclusion
In the data visualizations above, we can see that the model reflects the story told by the data. In the Experience and Role chart, Data scientists and software engineers stand out as doing well. In the Experience and Degree chart, we can see that with a Bachelor's degree, experience really matters. About 3/4s of the "yes" bar is filled with more experienced respondents. What we see in the visualizations is born out in the decision tree model.
The model was successful in improving our ability to predict a six-figure salary beyond a coin flip. While 70% accuracy doesn't guarantee a result (no model does), the decision tree can be instructional in carving out a career path in the data field. For the purpose and scope of this project, this model has utility in terms of prediction and providing insights into the survey data.