The Perils of Data Collection and Data Analysis

More data is better right? The answer to this is a bit tricky. Consider Malcolm Gladwell’s account of Cook County Hospital and how they addressed the problems of diagnosing chest pain

They instructed their doctors to gather less information on their patients: they encouraged them to zero in on just a few critical pieces of information about patients suffering from chest pain–like blood pressure and the ECG–while ignoring everything else, like the patient’s age and weight and medical history. And what happened? Cook County is now one of the best places in the United States at diagnosing chest pain.

Gathering the data is only part of the problem. Actually finding meaning in the data and knowing which data to use is arguably the harder part. Consider the NBA draft. Teams consider all sorts of things: where the player went to school, their height, how well their team did in the NCAA tournament. However, most of this data doesn’t translate well to the NBA. Stepping back, and narrowing in on some of the players’ college stats is actually much more practical in terms of predicting NBA success.

Sports is a land rife with data. The NBA in particular has seen a data explosion in recent years. However, it’s important to check what the data is actually telling us. A more dangerous problem is also to make sure that we are not being fooled into a false sense of security just by having more data.

What Can the Data Tell You?

Synergy Sports is a basketball stats site that provides play by play data analysis with the ability to actually see the NBA footage of each play. It also tries to tackle a popular problem in basketball: defense. Being able to evaluate if players are good or bad on defense seems like a very important problem. Synergy seems to have given us hope as it does indeed assign defensive values to players.

When I analyze the data though, I see a slightly different story. What if the analysis from Synergy’s data isn’t that individual defense can be handled with play by play data? What if the analysis is in fact that team defense matters a lot? By Synergy’s own volition, not all defensive plays can be attributed to an individual player. Here’s a quote from Synergy’s FAQ

What happened to the other play types? – We do not attach an individual defender on offensive rebounds, cuts or transition plays as these are team defense concepts and fault/credit usually cannot be attributed to one person.

Synergy has 11 categories for defensive plays; Isolation, Pick and Roll -Ball Handler, Post-Up , Pick and Roll –  Roll Man, Spot-Up, Off Screen, Hand Off, Cut, Offensive Rebound, Transition, All Other Plays.

This is much like the doctors at Cook County collecting lots of data. What happens when we zoom in a little though? Four of the categories are not attributed to individual players. This matters! I took a sampling of the best, median and worst defenses in the league* and here’s how the data shakes out.

Team Percent of Plays Not Assigned to Players
Boston Celtics 32%
New Orleans Hornets 33%
Charlotte Bobcats 36%

Almost a third of plays on defense — including the ominous “All Other Plays”, which don’t have video playback — can’t be assigned to the individual. There’s one other aspect of this that’s important too. Synergy assigns a Point Per Play (PPP) to each type of defensive play. In essence, how much does the opponent score in each of the scenarios? A higher PPP on defense is worse. Here’s a breakdown of these, again, using our best, middle and worst defense.

Team PPP – Individual Plays PPP – Team Plays
Boston Celtics 0.79 0.98
New Orleans Hornets 0.85 1.02
Charlotte Bobcats 0.89 1.06

We can see that the plays assigned at the team levels are those that hurt the most on defense. It’s possible that the silver bullet to defense is not finding out which individual players are good at defense. Rather, finding out how to improve team defense may be the best strategy — Ty Willihnganz has suggested that the coach and defensive scheme will dictate a team’s defensive success or failure.

An Abundance of Data With a Side of Overconfidence

Gladwell quotes another good tale in Blink from Stuart Oskamp. The gist is that a group of psychologists were given data on a patient and asked to make judgments. The kicker was they also had to say how confident they were. They were then given more data and asked to repeat the same judgements and confidence tests. As they got more data their judgements got a little better but their confidence skyrocketed! As a fan noticing the recent explosion of basketball data, this has me a little concerned.

Synergy provides the ability to zoom into the play by play level. The teams I showed above had between seven thousand to nine thousand defensive plays each for the 2011- 2012 season. As I mentioned, there is an issue in how many of these plays can actually be assigned to a player. Additionally, in digging through the data I noticed a few cases where the rows and columns totals didn’t quite add up to the breakdowns. (I have e-mailed Synergy for clarification on this.) The point is that by its own volition the Synergy data isn’t perfect. A few passes over the data confirm this.

However, I’ve already seen many people using Synergy data to show why players are good or bad at defense. Synergy also has the benefit of letting any subscriber view all of the plays via video feedback. At hundreds of plays per player and thousands of plays per team, it’s unlikely anyone has the time to actually watch a significant portion. Rather, having access to more data may very well provide the confidence that in facts drop the desire to do more in-depth analysis!

Summing Up

Data is powerful, make no mistake. As a basketball stats fan, you’re in the perfect time. A decade ago, sites like Synergy SportsBasketball-Reference and Hoopdata didn’t exist. However, that’s just a step in the process. Data collection used to be a daunting task. Now, the amount of data collected is massive. The real challenge now is data analysis. It may seem that I’m bashing Synergy. I want it clear though, I’m a huge fan! Synergy is an amazing tool and easily worth the cost (They’re having a sale through the end of this week, check it out!)

However, whenever we get data, it’s important to ask “What can it tell us?” and just as important to make sure we verify this. This post shouldn’t be seen as being against sites like Synergy. Rather, I want to make the distinction between data collection and data analysis clear. That way, when you use sites like Synergy you can make sure to focus on what it tells you and not just that it has a large amount of information


*Synergy has a great interface for searching data, but not the best for downloading it. As a result I had to manually enter the teams, which is why I limited my sample.

Comments are closed.