Sunday, July 24, 2011
My issue with regression equations
Patriot captures it right here:
Building your metric around a run estimator does not necessarily restrict you to simply plugging in the numbers in the appropriate place. Suppose you wanted to construct a metric based on batted ball types, strikeouts, and walks. One way to go about it would be to simply go through and estimate singles, doubles, triples, homers, and outs in play based on the percentage of each batted ball type that wind up as each. So, you would end up with equations that might look something like this:
Singles = .057FB + .217GB + .516LD + .017PU
However, if you believe that you have gleaned some other insights into the relationship between events that could improve your metric (such as strikeout pitchers having lower HR/FB rates) , you could still build that in to your formula for estimated home runs, and plug those into the run estimator. It’s more difficult than running a regression, and a more delicate balancing act (at least in terms of developing the formula), but it allows you to stay grounded in a model that estimates runs by taking a first step of, well, estimating runs.
He’s saying this (or if he’s not saying it, then that’s how I am reading it, and, in any case, it’s how I think it):
1. You start with a working model of how runs are created. This is the beauty of something like BaseRuns, because it works so darn well… GIVEN its inputs. If you know the number of hits, HR, walks, outs, then we have a fantastically great estimate as to how many runs are expected to be scored.
2. If you don’t know the inputs, estimate the inputs… but don’t change the actual run scoring model. So, again, if you happen to not have the number of doubles, but can estimate the number of doubles that this pitcher either gave up, deserved to give up, or was expected to give up, and it’s based on his batted ball distribution profile, and/or the number of HR he gave up, and/or his SO/BB ratio, then estimate the doubles in that manner.... but do NOT touch the run scoring model.
Once you have the estimates of all your inputs, then you can plug them into an established working model.
Even something like FIP is basically a regression equation, because it doesn’t adhere to an actual run scoring model. Of course, there is a tradeoff between complexity level. A linear equation is used at the expense of a real baseball run scoring model because it’s easier to compute or understand. But, if you’ve got a complex linear equation, or even a complex multiplicative equation, or some other form of equation, then you’ve got the worst of both worlds.
This is why I like FIP or wOBA, because they are such simple metrics, that its strengths and limitations are readily apparent.
So, ANY pitcher metric that is not grounded in BaseRuns is immediately setup for a limitation. The bigger your limitation, then the easier your metric must be.
SIERA, for example, is a good example of a metric that is too complex for its own good. The insights, the benefits of SIERA is hidden inside its complexity. But, if Matt were to follow Patriot’s lead here, and compute estimates for events (1b, 2b, 3b, hr, bb, so) based on his findings, about how things interact, then we would have a very helpful metric.
So, that’s my recommendation as to how you can really advance the cause: keep the logic of baseball intact if you insist on complexity.


Recent comments
Older comments
Page 1 of 344 pages 1 2 3 > Last »Complete Archive – By Category
Complete Archive – By Date