Introduction to Predictive Analytics for Fantasy Sports
As the popularity of fantasy sports has exploded over the last decade, so has the availability of resources to help you draft your team. Back in the old days, there were a handful of magazines and websites that provided fantasy players with better information than you could get from ESPN. Those days are gone. Today, everyone in your league has access to every website and article that you do. We're all sharing basically the same information. Good and bad fantasy teams now are separated by two factors: luck and effort.
I can't help you with better luck, but if you are willing to put in a little extra effort, read on... This article outlines a process that will give you insights into your league and player value that you won't find in any magazine, website, or fantasy podcast. The approach outlined in this article applies predictive analytics to your league's historical results to identify which variables (stats) are most likely to contribute to wins. Don't worry, Excel does most of the heavy lifting.
Because every league in every sport differs, I will outline the approach I took for my fantasy baseball draft in 2017. To apply the process to generically would be difficult to understand. Using my data will allow for specific instruction that should be easy enough to translate to your league.
What You'll Need
A summary of your league results. Between 3 and 10 years should do the trick. I used 9 years of results. You want to have enough data account for as much variation as possible, but want to make sure that if you go back several years, that the results are still relevant (same format, scoring, etc.). For each year, the summary will need to include total wins (and ties) and summary stats for each category (Football: rushing yards, passing yards, etc., Basketball: steals, points, FT%, etc., Baseball: HR, BA, ERA, etc.).
Microsoft Excel. You will need to add the Data Analysis plug in to Excel. Click here for instructions on adding the Data Analysis plugin. If you don't have Excel, Google Sheets is free, pretty much the same, and has a regression plugin as well. I haven't used it, but I'm sure it's swell.
Working knowledge of Excel. I'll walk through each step, but this is not an Excel tutorial. If Excel is foreign to you, this could be difficult.
A basic understanding of multiple regression. Even if this is your first time hearing the term, I will explain it enough in this article such that everything should make sense. If it doesn't, or you want to learn more, Google it.
Process Overview
The predictive modeling technique covered here is called multiple regression. Multiple regression is used to test the relationship between several independent variables and a dependent variable. In our case, the independent variables are our stats (HRs, ERA, SBs, etc.). Our independent variable is Wins + Ties. To put it simply, multiple regression will tell us which stats have a statistically significant effect on winning. The output of this technique is a formula that will allow us to plug in our projections to produce a predicted win total.
About My League
10 teams compete in head-to-head weekly match ups in the following 12 categories:
Batting
Runs scored (R)
Home Runs (HR)
Runs Batted In (RBI)
Stolen Bases (SB)
Batting Average (BA)
On Base + Slugging Percentage (OPS)
Pitching
Strikeouts (K)
Wins (W)
Saves (SV)
Earned Run Average (ERA)
Walks+Hits/Innings Pitched (WHIP)
Strikeouts/Walks (K/BB)
Scoring is simple. A win is awarded for each category. So, if Team A plays Team B and wins every batting category and loses every pitching category, both teams will be 6 and 6 at the end of that week. Ties are counted as such, so if a team wins 3 categories, loses 3 categories, and ties 6 categories, that team would end the week 3 - 3 - 6. The league plays for 22 weeks, so the best possible record 264 - 0 - 0.
To give you an idea of how this shakes out, below are the average results for each place in the standings (1st - 10th) in the 9 seasons represented in this project.
Note: Ties are multiplied by 0.5 and added to wins. So 100 wins and 100 ties would equal 150 wins + ties.
Analyzing the Variables
Step 1: Paste or download league data into Excel.
Most sites will allow you to copy-and-paste or download your league results as a CSV or Excel file. The only thing I had to add to the data above was a calculation for WinsPlusTies, which is wins + (ties * .5).
Step 2: Use the Regression function included in the Data Analysis plugin to analyze each variable.
Here is the Regression dialogue:
In the above dialogue, the Input Y Range is the target dependent variable (wins + ties). The X Range contains all of the stats data to be tested. I checked "Labels" because my data had column labels. I chose "Constant is Zero" so that the Wins + Ties estimate is not negative. I selected a confidence level of 95%, which is typical for most research. Lastly, I chose for my output to load into a new worksheet.
Step 3: Review results and refine the model.
In the results above, the diagnostic measures that are used to determine the model's value are highlighted. In the top area, the Adjusted R-sq tells us that this model accounts for 98% of the variation in the data. That is ideal. A low Adjusted R-sq does not mean your model is worthless -- just that it may be less reliable. Generally, anything over 90% represents a model that should be reliable, assuming the overall model and individual variables are statistically significant (we'll get to that).
The Standard Error represents the +/- of the model's prediction. This model has a SE of 9 Wins + Ties. So, if a prediction is 160 Wins + Ties, between 151 and 169 Wins + Ties should be expected.
In the next section, Significance is highlighted. This is the overall statistical significance of the the model. Anything less than or equal to 0.05 is considered significant. In this case, the model is well below 0.05.
In the bottom section, each variable is evaluated. The two important figures here are the Coefficients (we'll deal with those later) and the P-value. The P-value tells us whether each variable has a statistically significant impact on Wins + Ties. Again, anything less than or equal to 0.05 is statistically significant.
Quick digression: We chose to use 0.05 to measure significance. This gives us 95% confidence that our the values produced by our model would be within our confidence intervals in 95 out of 100 samples. While 0.05 is mostly used, 0.10 is used sometimes. Which you choose is a matter of how confident do you need to be in your model. Because this is a pretty low stakes endeavor, we could comfortably accept 0.10 for 90% confidence.
The significant variables are highlighted in green (Wins and Saves) and everything else in red. Does this mean only Wins and Saves matter???... No. Building a regression model is an iterative process. More powerful statistical packages provide multiple ways to do this, but in Excel, and for our purposes, the easiest method is to simply take out one variable at a time (starting with the highest P-value) until only significant variables remain. Below are the final results after several iterations.
The final model is pretty strong. The Adjusted R-sq is 98%, the overall model's significance is well under 0.05, five variables have P-values under 0.05 and one (K/BB) is under 0.10. I chose to keep K/BB because I felt the overall strength of the model is better with it than without it. This is a good time to note that predictive modeling is a lot science, but also some art as well. It is important to incorporate subject matter expertise wherever it makes sense. You may choose to include or exclude a variable even if the diagnostics say otherwise. The important thing is that you test and continually refine your model based on how it performs in the real world -- not in Excel.
Final Regression Formula
In order to build our final regression formula, we use the Coefficients for each variable. Our formula is as follows:
Predicted Wins + Ties = (R * 0.05) + (OPS * 105.77) + (W * 0.47) + (SV * 0.23) + (ERA * -20.39) + (K/BB * 4.95)
With this formula, we can plug in our stats and get a predicted Wins + Ties total. For example, if a player is expected to score 100 runs, that run total would equate to 5 Wins + Ties (100 * 0.05).
Interpreting the Model
Before jumping into application, let's walk through some insights that can be gleaned from this process and the final model. Let's start with Home Runs. Even though HRs were not included in the final model, several iterations included HRs. In every iteration, HRs had a negative effect on Wins+Ties. If you look at the model that included every variable, you'll see that each HR was worth -0.07 Wins+Ties. This is obviously counter intuitive. More HRs should result in more wins. My interpretation of this is that because many volume home run hitters do poorly in most of the other batting categories. So, even though HRs were not included in the final model, it was strategically helpful in drafting my team because I knew that if I could get HR hitters that do well in other categories, I would have an advantage.
Using the model projections, I was also able to determine that pitching adds more predictable value than batting. I went into this project with the assumption that closers (pitchers that get Saves) were of limited value. Closers are actually among the most valuable players in my league. In terms of batting stats, OPS and Runs are the only reliable predictors of wins. This does not mean the other categories do not matter, it just means that they are not reliable predictors.
Keep in mind, the insights from my data may or may not apply to your league. Your data may show something very different. Differences in scoring and the make-up of the teams from league to league will make each league's results unique to some degree.
Applying the Model
In order to apply the model to drafting my team, I loaded the projections for the top 300 players into Excel and applied the regression formula. This produced a predicted Wins+Ties total for each player. I then used those predictions to calculate every players' value (in terms of Wins+Ties) above or below the top average of the top 150 players. I did this separately for pitchers and batters to allow the two to be compared. The last thing I did was calculate the drop off from one player in the rankings to the next. This is useful in deciding to draft between two positions. For example, if it is my pick and I need a shortstop and second basemen, I can look at the drop off from the best player at each position to the next best player, and take the player where the drop-off is most significant.
Testing the Model
In order to test the model, it must be applied to real life data that was not included in the training of the model. I used my league data from 2017 to test my model. Here are the results.
In the table above, the first column shows the actual wins for the 2017 season. The second column shows the predicted totals, and the last column shows the error for each prediction. The Standard Error for this model was 9. On average, the predictions' error was 8 (Yay model!!!).
For 6 out of 10 teams, the model was almost perfect. My team finished 3rd and outperformed the model by 16 Wins+Ties. This had a lot to do with HRs. I had the #1 HR hitter and 3 of the top 10. All 3 players produced well in other categories and therefore did not negatively impact my win total like the data suggests HR hitters should.
Final Thoughts
More than simply producing a formula for predicting wins, applying predictive analytics to your fantasy league provides insights that are otherwise invisible. The process and the resulting model gives you a look "under the hood" to see how your league works. Not only do you learn which stats matter, you learn the degree to which they matter. As we've seen in my example, even data that doesn't make it into the final model can be eye opening and offer a strategic advantage. To me, more than the predictive model itself, the actual process of gathering the data and iterating through various models is the real value in this process.