<$BlogRSDUrl$> <body><script type="text/javascript"> function setAttributeOnload(object, attribute, val) { if(window.addEventListener) { window.addEventListener('load', function(){ object[attribute] = val; }, false); } else { window.attachEvent('onload', function(){ object[attribute] = val; }); } } </script> <div id="navbar-iframe-container"></div> <script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script> <script type="text/javascript"> gapi.load("gapi.iframes:gapi.iframes.style.bubble", function() { if (gapi.iframes && gapi.iframes.getContext) { gapi.iframes.getContext().openChild({ url: 'https://www.blogger.com/navbar.g?targetBlogID\x3d5774626\x26blogName\x3dCollege+Basketball\x26publishMode\x3dPUBLISH_MODE_BLOGSPOT\x26navbarType\x3dBLUE\x26layoutType\x3dCLASSIC\x26searchRoot\x3dhttp://collegeball.blogspot.com/search\x26blogLocale\x3den_US\x26v\x3d2\x26homepageUrl\x3dhttp://collegeball.blogspot.com/\x26vt\x3d6980192687323097252', where: document.getElementById("navbar-iframe-container"), id: "navbar-iframe" }); } }); </script>
yoco :: College Basketball
(a sports weblog) news and commentary on men's college basketball and the ncaa tournament

yoco :: College Basketball has a new home! If you are not automatically redirected to http://www.yocohoops.com in 5 seconds, please click here.

Tuesday, December 14, 2004

even more stats

First, my favorite stat of the week: this blog has finished an impressive 4th in the voting for best sports blog on the net; it's the only blog dedicated to basketball that made the list. A nice holiday present for Yoni when he returns.

On to the promised theory of individual PPP (points per possession). I'm going to throw some technical specifics in parentheses in case someone with a statistics or econometrics background is reading, but it should all make sense even if you skim or skip the math.

The Goal: Last time I talked about how any composite stat that uses the standard inputs -- points, assists, rebounds, turnovers, etc. -- has the same inherent biases as those inputs. So my basic idea to determine a player's contributions is to directly observe the PPP a team records on offense and defense when that player is on the floor, then use a statistical technique to separate his contributions from those of his teammates, relying on the fact that he plays with a variety of lineups. This is a little like a generalized plus/minus statistic, which averages out teammates' effects. One nifty feature is that we'll measure offense and defense on the same scale, which means that if this works, we could compare (or just subtract one from the other) offensive PPP and defensive PPP by player and judge both overall contribution and whether a player is relatively more valuable on O or D.

Let's start with the key assumption: every player has a unique contribution to his team's PPP on offense and defense -- let's call these stats OPPP and DPPP, with a high OPPP being good, and a high DPPP being bad. That is, regardless of who else is on the floor, a player has some true contribution to his team's chance of scoring or preventing a score. In order to calculate the expected OPPP or DPPP at any given time, we then simply add the individual stats for the 5 players in the game. So, making these numbers up, if J.J. Redick has an OPPP of +0.3 and a DPPP of +0.2, and Daniel Ewing has an OPPP of +0.2 and a DPPP of +0.1, that would mean substituting Ewing for Redick would decrease the expected points scored on each Duke possession by a tenth of a point, but would also decrease the expected points scored on each opposing possession by a tenth of a point. Later we can relax the assumption that individual PPP's are independent, which will be equivalent to testing if certain players make each other better or play well together.

Estimation: So how do we estimate OPPP and DPPP? Here's where we use the key advantage basketball has over baseball in calculating stats: basketball has frequent in-game substitution, with teams employing perhaps a dozen unique five-player combinations over the course of a game, and many more over the course of a season. This means we can observe how a player performs with a variety of teammates, and estimate his individual contribution based on the differences in PPP between those different lineups. It would be silly (because of the more individual nature of the game) and impossible (because of the relatively few in-game substitutions) to do something like this for baseball; it would entail observing how many runs per inning a team scored or allowed when a player was in the game, then indirectly estimating his contribution. But basketball's more frequent substitutions make this approach feasible, at least in theory.

(Math: An 8-man rotation allows for 56 possible combinations, a 9-man for 126, and a 10-man for 252. Obviously, most possible combinations will never be employed, like playing two centers and three forwards. But if even a quarter of the combinations play together, that suggests roughly somewhere between 15 and 60 unique lineups over the course of a season, depending on the coach's strategy. To statistically identify contributions from some number of players, a rule of thumb is that we'll need data on twice as many unique lineups to get good estimates. Also, we're in trouble if some pair of players are always on the floor at the same time.)

Data Needed and Methods: The data we need are, for each 5-man lineup employed by a team: 1) the number of minutes that lineup plays together 2) OPPP for that lineup 3) DPPP for that lineup. And that's it. Using a regression, we can calculate the "best-fit" individual PPP statistics on offense and defense, i.e., the individual PPP's that best explain all of the different lineup PPP's. We would use a method that emphasizes getting the right number for lineups that play more together, and thus for players who play more minutes -- we care more about getting a precise estimate for Redick than for Lee Melchionni.

(Math: What I'm talking about here is a weighted least squares regression. The equation would have OPPP or DPPP as the dependent variable, and an indicator variable for each player as explanatory variables, with no intercept term for now -- including one would give us a sort of PPP above replacement player. For any lineup, five player-indicators will be 1, and the rest will be 0. The coefficients on these indicators are our estimates of PPP. In fitting the data, we weight the error terms by the minutes played for each lineup, so that more importance is placed on matching observed PPP for the starting lineup than the garbage-time team.)

Extensions: Above I mention the possibility of testing whether players' PPP is truly independent from one another. We can pretty easily test this for a pair of players (Math: This fits right in to our setup: just include an interaction term -- the product of the indicators -- for a pair of players, and test if it is significantly different from 0). We couldn't test every pair of players, because that's too many variables to identify, but we could test specific pairs, e.g., Joey and Steven Graham, if we suspected they had an effect on each other. Also, we could make some effort to adjust for the strength of the opponents on the floor, adjusting for average opposing team's PPP in our data.

Wrapup: This idea requires a lot of observed data to get good estimates, and is definitely the sort of thing that would be applied to an entire season, rather than a single game. The more data, the more precise estimates we get. Calculating career PPP would be especially effective, as over the course of a multi-year careera player would play with hundreds of lineups. But all the play-by-play data we'd need from each game to calculate lineup and individual PPP is
1) Substitution: who replaces whom, and when
2) Points: for and against, and when they are scored
3) Possession: defensive rebounds, made field goals (or terminal free throws), and turnovers to determine the number of possessions for each team, and when they occur


This is just an idea right now, and I'd be curious to hear whether it appeals to you as a different way to estimate individual player value, what problems or benefits you see that I missed, or whether it makes sense at all.

EDIT: A friend with whom I'd been discussing this topic just found this link, which implements something very similar on NBA stats. On the one hand, it's cool to know that something similar to what I came up with independently is doable and gives meaningful results; on the other hand, it's a little sad to find out that it's not original (the setup is actually somewhat more advanced, which makes this feel a little bit like thinking you invented the wheel, only to see someone drive past you in a car), though I don't think it's been done on college stats, nor is there a test for whether particular players play well together. The funny thing is I actually corresponded briefly with the author when I was thinking of writing my senior thesis on the game theory of the NBA luxury tax, but I somehow hadn't seen this study before.