# Modeling Win Probability for a College Basketball Game: A Guest Post From Brian Burke

*Today’s Guest Post is from Brian Burke. Readers may know Brian from Advanced NFL Stats, a site that provides some of the very best statistical analysis of the NFL. With football season over, Brian has turned his attention to college basketball. He already has a web site set up for college basketball [http://wp.advancednflstats.com/bball]. And this post introduces a very interesting approach to the analysis of this sport. Before getting to the post I want to thank Brian for writing this for The Wages of Wins Journal. Again, I think everyone will find this to be quite interesting.*

Although I usually stick to football research, I’ve recently dipped my toe into studying basketball. I’ve built an in-game win probability (WP) model for NCAA basketball. Basically, it takes the score and time remaining from any moment of a game and estimates the chances that each team will win. Although others have developed WP models for basketball before, I’ve gone a step further and created a web site with a live feed that graphs the WP for every game in real-time [http://wp.advancednflstats.com/bball].

In football WP estimates are very useful. Football is a game of strategy decisions such as ‘kick a field goal or go for the first down’, or ‘punt from your end zone or accept an intentional safety’. WP can tell you which decision is usually best and can identify when coaches are making big mistakes. It can also tell you which plays were truly important in any game. Sure, that incredible acrobatic 20-yard reception will make it on Sports Center, but the real ‘play of the game’ was that otherwise unremarkable 5-yard straight-ahead run on 3rd and 4 in the 4th quarter that let the winners burn another 3 minutes off the clock.

To be honest, once I started the WP project for NFL games, it just got to be plain fun. When the season ended I looked around for another sport to model and decided on college basketball.

The WP modeling technique I use is sometimes called an ‘empirical matrix.’ I took a set of play-by-play data from recent years of NCAA regular season games 1,782 games from the past 3 years—360 thousand in-game observations in all] and divided it up by home team lead and by time remaining. I simply observed the proportion of times that the home team went on to win the game. Table One presents these observations.

**Table One: Home team winning with lead at different points in a college basketball game**

With enough data, that’s almost all you have to do. But because of limited sample size in many of the cells (there may not be many combinations of 17 point leads with 19 minutes to play), the results will be somewhat noisy. To reduce the noise, I used logistic regression. For each minute of time remaining, I ran a regression using the current score difference to predict win probability for the home team. Graph One illustrates an example of the raw, unsmoothed data and the resulting regression estimate:

**Graph One: Home Team Win Probability.**

Here is what a typical game’s WP timeline looks like. This is the recent Villanova-Notre Dame game:

**Graph Two: Villanova vs. Notre Dame**

One thing I’ve already noticed that’s interesting about basketball is that the win probability equation is the same for nearly the entire game. In other words, a 6-point lead for the home team in the first 10 minutes of the game yields the same WP of 0.86 as a 6-point lead with 10 minutes to go in the 2nd half.

This surprised me. I would have expected any certain lead to be more decisive as the game went on, gradually becoming more and more insurmountable. In the graph I cited above, the “slope” of the curve would theoretically get steeper and steeper as the game goes on. When I went to make a graph of selected times in the game to show how the curve steepens, I could only see a single curve. I thought I had made some kind of error in Excel, but the curves were just superimposed. Not until the final couple minutes do the curves become very steep, when ultimately a 1-point lead with zero seconds remaining is as decisive as a 10-point lead.

Although basketball doesn’t have the same strategy elements as football, there are some interesting potential applications of WP in Dr. Naismith’s creation–when to start fouling, when to slow down the game, the value of simply possessing the ball, or how much the ref’s bogus call really swung the game.

I can’t answer all those questions just yet, and I should probably leave that stuff to the basketball experts. But I’ve already learned a lot, particularly from a comparative-sports perspective. Just as learning a foreign language helps one more thoroughly understand your own language, to truly understand a sport one should understand how it differs from the other sports. I’d like to improve the model in some ways, particularly with respect to non-continuous considerations in the crucial final minutes. For example, a 4-point lead is more than 33% better than a 3-point lead because the game is essentially out of reach of a single possession. I might also like to include factors such as penalty bonuses or time outs remaining.

I should also note that the model is generic. Even if my team, Navy, were playing at Duke, my model would yield the same WP estimate as for any two other teams. There are ways to factor in team strength, but a generic model is a good baseline for now.

I thought I’d share this with hard core basketball stat-heads out there, and I figured this would be a good place. Thanks to Dave for allowing me to post here. And yes…I’ll probably have an NBA version up and running in time for the playoffs.

Brian Burke