Sharp-eyed readers of the Wages of Wins may remember Philip Maymin from a few years ago as one of the authors of this blog post on early foul trouble which used Wins Produced numbers to evaluate when and whether a coach should bench players because of fouls. Today he is writing about a new way to project college performance to the pros using machine learning techniques, and how much money teams have effectively lost by failing to do so over the past decade. More generally, he points out that analytics is perhaps poised to take off from being a secondary piece in decision making to being an actual source of revenue for a team.
Every year, some late-drafted players turn out to be stellar. Of course, in hindsight, every team should have drafted, for example, Paul Millsap. But was there any way of knowing at the time that he should have been picked?
Projecting from college to NBA performance is a notoriously difficult and noisy task. The prospects are young. They may not have even finished growing. And they often don’t blossom until their second or third year. Traditional approaches such as regression analysis are too constrained, too linear, and have difficulties with outliers.
A modern approach to such problems relies on machine learning techniques. Machine learning refers to a class of tools in which the computer automatically learns from examples. It can be flexible, non-linear, and appropriately sensitive to outliers. An example of a machine learning technique is neural networks. These are inspired by the way the neurons in our brains work, where neurons only fire after they have been activated enough times by other neurons.
What if we let a machine learning algorithm loose on past NCAA data and future NBA performance to see if the computer can find a relationship?
There are two ways to do this. One is to use a classifying method that splits future players into categories, such as All-Star, starter, bench player, or replacement level player. Machine learning techniques such as support vector machines use such supervised learning methods to try to figure out what important historical characteristics contribute to a player who turns out later to belong to the given class.
The other way to find a relationship requires having a number for each player representing their subsequent NBA performance. This number should represent the player’s true and total contribution to his team. Do we have such a number?
Surely, Wins Produced is a natural candidate for this value. It reflects the number of wins that the player contributed to his team that year. It is consistent, stable, and does not, for example, reward inefficient shooting.
Why Wins Produced (WP) instead of Wins Produced per 48 Minutes (WP48)? The biggest reason is that WP48 is by definition noisy for players with a low number of minutes, particularly with variability from year to year.
How many years of production should we look for in a player? Even LeBron James was not exceptionally productive in his rookie year. Blake Griffin did not even play his first year due to injury. Yet neither one should be labeled a bust!
A natural cutoff is three years in the league. By the end of that time, teams have to begin making long-term decisions about whether to keep the player or not. It is also enough time for players to at least begin to develop.
So our measure of a prospect’s future NBA performance will be the average number of Wins Produced over his first three seasons in the league. For players drafted since 2012, who have not yet had an opportunity to play for three years, we will take the average only of the number of years they could have played so far. For players who missed a year such as Blake Griffin, let’s conservatively assign a value of zero wins produced for such years, since we won’t know if it is due to injury or lack of talent and playing time.
What historical college data should we use? There is a wealth of information on NCAA players over the past ten years, including basic and advanced stats, splits depending on location, outcome, and opponent strength, combine measurements, and more. Machine learning techniques do not suffer as much from the usual problem of traditional regressions when there are too many variables, so we can just throw the whole kitchen sink in there.
Once the algorithm runs, we will know its sorted list of prospects for each year. For each team, we can compare their actual draft choices with who they should have picked instead.
And what do we find? The figure below shows the overall results. The horizontal axis shows the production of every team’s actual draft choice. The vertical axis shows the production they could have had instead. Each dot represents the draft order, from the overall number one pick to the last pick. If the model would have tended to help teams draft better, then we should see more dots to the left and above the 45-degree line. If teams are doing better without the model, we should see more dots to the right and below the 45-degree line.
Only one dot is substantially below the 45-degree line: the number one overall pick. In other words, teams did better picking the top choice than the model did. But this is primarily due to LeBron James and Dwight Howard being picked there in 2003 and 2004, two players with no college history on which the model could project. In other words, the model is working at a disadvantage, since it can only recommend good NCAA players, while teams can pick NCAA players as well as high school and foreign players. Nevertheless, despite this handicap, the model still outperforms substantially.
Even among the top five picks, the model was almost as good as the actual team choices. Team choices generated an average of 3.97 wins per year for each the first three years, while the equivalent model choices would have generated only 3.85 per year. How much is that difference of 0.12 wins per year worth? If we do the usual calculation to determine that the value of a marginal win is about $1.65 million, then drafting in the top five picks according to the model would have cost the teams $1.65 million * 0.12 * 3 = about $600,000.
So for the first five picks, it is approximately a tie, with the teams doing a little bit better. But for every other group of picks, the model outperforms substantially. In picks six through fifteen, teams could have had an extra nearly $7 million worth of production per draft pick. And amazingly, the same gap holds for the second half of the first round. The first half of the second round, picks 31 through 45, would have brought $4.5 million per pick more production if the teams had chosen according to the model. And in the second half of the second round, picks 46 through 61, the team’s actual choices and the model are again roughly in line, this time with the model having a slight lead.
How would each team have done? We can go through all of the picks made by each team over the past decade and compare them to the best model player still available at that slot. Then take the difference in win production over the subsequent three years and convert to dollars. The table below lists these results for each team.
Memphis gave up the most value relative to what it could have had. Across its 24 draft picks over the past 11 years, the Grizzlies missed out on more than $200 million worth of value. Golden State, Toronto, and Phoenix also gave up nearly $200 million each in lost profits.
At the other extreme, Chicago drafted better than the model, on average. Compared to the model picks, the Bulls saved $85 million. New Orleans also beat this model, by about $17 million.
It is worth noting, however, that this model is very conservative. It simply recommends the best player available according to the machine learning projection. For example, if you are drafting tenth, and player A is expected to be taken next, and player B may go undrafted, but B appears to be slightly more valuable than A, the model would recommend taking player B. This is the conservative basis for the figures and tables presented here. A more realistic model would take into account the optionality of possibly being able to acquire a later draft pick to also take the second player. Adjusting for this optionality, the Bulls and Pelicans would have lost money too.
Now, it is clear that drafting according to this model of machine learning was not an option for most teams ten years ago. Some of the techniques had not even been developed yet. So it is not fair to blame a front office for failing to use a tool that was not available.
However, such tools are available now, and the numbers in the table can not be added across teams: it is not the case that if every team begins to draft according to such a model, that they would all add hundreds of millions of dollars of value. If everybody drafts according to the model, there would be no additional gains. On the contrary, the gains to using the model are restricted and concentrated only to those teams who use it.
For example, if Memphis had been the only team to use this model, it would have received almost a quarter of a billion dollars in extra production over the past decade.
Among many teams, analytics is often viewed as just another input. In fact, if done properly, it seems as if analytics can be a substantial revenue source for a team. Neglecting it could be costing your favorite team millions of dollars in lost production each year.
It is possible that basketball analytics is at a similar crossroads to where Wall Street was a few decades ago. Guts feelings and intuition were highly prized both culturally and monetarily, until a few visionary companies began a more quantitative and analytical approach, eventually leading to a huge influx of Ph.D.’s and quants. And those quants became ever more specialized with ever-increasing salaries, because in a competitive environment, the second-best model is almost as bad as no model at all. At the time, it was as unthinkable on Wall Street as it would be today in sports for the highest paid employee to be a number-cruncher rather than a former athlete. Today, in finance, it is the norm. What will tomorrow bring for sports?
– Philip Maymin
Philip Maymin is an Assistant Professor of Finance and Risk Engineering at the NYU School of Engineering. He has authored or co-authored several basketball analytics papers on topics such as early foul trouble, free throw shooting trajectories, team chemistry, and acceleration in the NBA. He is a founding co-editor-in-chief of the Journal of Sports Analytics, a new journal which is now soliciting general submissions of practical sports analytics research. And he has been an analytics consultant for several teams.
In finance, Philip has published more than 20 articles in the fields of behavioral finance, algorithmic finance, portfolio management, and risk management, as well as a textbook on derivatives pricing. He is also the founding managing editor of the journal Algorithmic Finance. He has been a portfolio manager at Long-Term Capital Management, Ellington Management Group, and his own hedge fund, Maymin Capital Management. He has also been a policy scholar for a free market think tank, a Justice of the Peace, a Congressional candidate, a columnist, a sportswriter, and an award-winning journalist. He was a finalist for the 2010 Bastiat Prize for Online Journalism. He holds a Ph.D. in Finance from the University of Chicago, a Master’s in Applied Mathematics from Harvard University, and a Bachelor’s in Computer Science from Harvard University. He also holds a J.D. and is an attorney-at-law admitted to practice in California.
His home page is here and the most up-to-date version of his working paper on profiting from machine learning in the NBA draft can be found and downloaded here. It also includes year-by-year draft and comparison tables for each team.