THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, November 23, 2009

Testing the run modelers - why?

By Tangotiger, 12:28 PM

Eric Walker, one of the original sabermetricians working in MLB, takes a look at various run modelers.  He finds, unsurprisingly that XR leads the way.  But, you don’t test a model that’s been best-fitted to a sample, and then test it against the same in-sample.

John Jarvis for example did the exact same thing as Eric Walker did several years ago.  Jarvis however tested one extra linear equation: simply a linear regression equation, with no logical underpinnings to how runs are created.  Whatever it spits out, it spits out.  And if you look at Table 2, you see the result: a double is worth just 0.15 runs more than a single.  A SB has almost no run value.  It’s ridiculous and it’s stupid (and I’m being kind).  It bears no resemblance to how runs are created.  We know exactly how runs are created.  And the difference between a double and single is around 0.30 runs.

So, what does XR do?  It has a 0.22 run gap.  Why does it do that?  I presume because it split the difference between the best-fit (0.15) and the true (0.30) in order to get at the “accuracy” and something reasonable.

Finally, the minimum requirement to test these things, at the very least, is for all of them to use the same parameters.  You can’t have one include reaching on error, while the other doesn’t.

And which one wins?  Well, Eric’s own TOP, which, probably, has been best-fitted to the data.  Now, it’s possible that Eric’s method does have logical underpinnings like David’s BaseRuns.  And I’m sure someone out there like Patriot or Colin will be tempted to see what comes of Eric’s method using the plus1 method, or other ways.

***

Finally, he quoted someone as saying:

Wikipedia cites Tom Tango as stating that BaseRuns models the reality of the run-scoring process significantly better than any other run estimator. (We shall see.)

Except he didn’t check that.  Just because you get a better best-fit does NOT mean you have a better estimator.  Correlation is not causation.  (God I hate regression so much, when it’s used as the end-product.) Indeed, he does nothing to refute the claim.

Also, he says:

R = A*B/(B+C) + D

Jolly good luck deciphering that without extrinsic information. On further examination of the associated text, it turned out that what was meant was—

R = (A x [B / { B + C } ]) + D
...
(In principle, there is an implied order of precedence for arithmetic operations such that parentheses are often not needed, but not only do few people know it—I’d have to look it up—but there is never any guarantee that the writer of a given equation knows it either, or even knows that it exists.)

Well, I learnt it in Grade 7 I believe.  Multiplication and division precede addition and subtraction.  If Eric wants to admit that he doesn’t know the precedence rules, that’s fine.  But to presuppose that the general population doesn’t kmow ("few people know it") is silly and is made only to support his argument.  That the BaseRuns equation HAS parenthesis automatically makes that the first equation to evaluate.  And then, REGARDLESS of the multiplication/addition precedence, you either follow it left-to-right, or follow the known precedence rules, you still get the same result.  Indeed, only if you believed that the addition takes precedence would you be wrong.

I believe his whole text regarding that was unnecessary, and simply was a pet peeve of his.  And my pet peeve is describing your own pet peeve as something that is larger than what it actually is.


#1    Patriot      (see all posts) 2009/11/23 (Mon) @ 13:12

I looked at Walker’s formula before, briefly, when it was posted here, and have linked the thread.  Here’s what I said about it at the time:

Thanks for pointing that out Victor; I have visited the HBH site several times over the last decade (ugh, literally) and never seen that page where they actually give the formula before.

Taking a cursory look at it, it is a RC-type model: A*B/C.  The reason it looks confusing is that the A and B components each have a regression equation attached, but they seem to be:

A = .917(H + W + HB + E - CS) + 4.703
B = .922(TB + .4(W + HB) + .7SH + .8SB) + .007
C = PA

Regardless of the RMSE, it does not appear to be unique conceptually, although I assume he developed it independently (and perhaps prior to) James publishing RC, which is impressive (although not relevant to its usefulness today)

This was based on the version published on his site, which appears to have different coefficients than what is published in his article.  Nonetheless, it is clearly an A*B/C model and will be subject to the same breakdowns as RC (it may well do a better job of avoiding those issues over wider range, but they will inevitably pop up at some point).

I really don’t understand the pet peeve about parentheses and brackets either.  R = (A x [B / { B + C } ]) + D is the form that’s going to confuse me at first glance.  Frankly, I’m shocked that someone as bright with Eric who has worked with math would have to look up order of operations.


#2    rluzinski      (see all posts) 2009/11/23 (Mon) @ 13:15

Mrs. Glosser’s “Math Goodies” is the first google hit for “order of operation”:

http://www.mathgoodies.com/LESSONS/VOL7/order_operations.html

It’s fair to assume that the majority of people reading about run estimaters have taken pre-algebra.


#3          (see all posts) 2009/11/23 (Mon) @ 13:34

Behold.  PEMDAS


#4    Rob      (see all posts) 2009/11/23 (Mon) @ 13:38

Who doesn’t know the saying, “Please Excuse My Dear Aunt Sally”?


#5    Patriot      (see all posts) 2009/11/23 (Mon) @ 13:46

I did a +1 analysis on Walker’s “TOP, no error data” formula with the 2009 AL.  I ignored interference.  Of course, I can’t be sure if I’m applying Walker’s formula correctly, but I get a total estimate of 11073 runs for the league, which seems reasonable as the actual total was 10938.  Here are the +1 (actually partial differentiation) values I got:

.525 S, .673 D, 1.186 T, 1.484 HR, .337 W+HB, .034 SH, -.107 AB-H+SF, .246 SB, -.370 CS


#6    Patriot      (see all posts) 2009/11/23 (Mon) @ 13:56

Another point here: Walker’s own formulas use actual error data, which most of the others do not consider explicitly.  There’s nothing wrong with that, of course, except that when you do an accuracy test, some formulas are working with additional inputs.

Also, the BsR equation used in the test does not include SH or SF.  Not a big deal, but worth noting.


#7    Tangotiger      (see all posts) 2009/11/23 (Mon) @ 14:24

”.525 S, .673 D,”

Well, there you go.  Eric’s equation shows a gap in singles and doubles of 0.15 runs (.148 actually).

No wonder his is so accurate.

What Eric did here is exactly the kind of sh!t I’ve been railing against for several years: making your point through mathematical gymnastics.

Indeed, we have no idea why Eric’s equation came out as well as it did until Patriot did the simplest of tests.

This is all I argue for when someone presents seeminly “proof-positive” kind of results.  This is true of Bill James’ latest version of RC (which Patriot shows that it overweights the singles), or Eric’s equation here, or the Sakes/Hauer model.  In all cases, we are talking about creating an equation to fit the sample, and then using that equation as if correlation=causation.  It is irrelevant, completely irrelevant, if I find a dataset where the run value of the triple exceeds that of the HR, regardless of its standard error.  It is meaningless because the logical underpinnings has to hold that the run value of the HR must exceed that of the triple in all instances.

What happens if that these parameters are ALSO standing in for other undeclared parameters.  I don’t know why the double comes out so low, but clearly the double shows some sort of team bias in the aggregating of the data.  Because you do NOT see this phenomenon at the game or inning level.  It is only when you aggregate the data at the team-seasonal level does the bias of the double rear its ugly head.

And in NO WAY will this bias carry down to the game, inning, or player level.  And yet, all we really care about with these equations are at the game, inning, and player level.

Talk about, yet again, extrapolating an equation beyond its data set.

Instead of doing that to highlight the problems, these guys don’t do that, and declare that their equations somehow give some hidden insight.


#8    weskelton      (see all posts) 2009/11/23 (Mon) @ 15:04

I know James has continued to tweak RC over the years in various outlets, but I don’t remember seeing this one.  Here he’s finally stepped away from weighting the hits in the B component by their TB values, i.e. 1B=1, 2b=2, 3B=3, HR=4.  Anyone know when and where he introduced this?  I’d be curious to read his comments on this “upgrade”.

“Technical" Runs Created, 3rd Version:

RC = (H + BB + HB - GDP - CS) x (BaseWeights + [0.29 * {BB - IBB + HB}] + [0.492 * {SB + SH + SF}] - [0.04 * SO]) / PA

BaseWeights = [1.125 * 1B] + [1.69 * 2B] + [3.02 * 3B] + [3.73 * HR]

PA = AB + BB + HB + SH + SF


#9    Patriot      (see all posts) 2009/11/23 (Mon) @ 15:09

It was introduced in one of the Bill James Handbooks probably around three years ago.  I believe he said that he’d found the accuracy had declined in the high-offense era, and that was the impetus for the changes.


#10    weskelton      (see all posts) 2009/11/23 (Mon) @ 15:32

Thanks Patriot.  I’ll have to dig out the ones I have and see if I can find it.


#11          (see all posts) 2009/11/23 (Mon) @ 17:04

I talked to a statistician and baseball fan about using regression to estimate weights and he said it will not work well at all.  Regression assumes that every object has a different value.  The less likely for there to be a value (triples), the less accurate.  He said it would be best if games with triples were analyzed together, but even then there are few games with multiple triples.  The triples are acting like a yes/no (raining, in a dome, turf) in the equation than a counter.  The more he thought about it, triples and their rarity will cause problems with any regression analysis.


#12    Tangotiger      (see all posts) 2009/11/23 (Mon) @ 17:10

It’s not working well with doubles either.


#13    dave smyth      (see all posts) 2009/11/23 (Mon) @ 19:31

Sad....just sad.....


#14    Eric Walker      (see all posts) 2009/11/23 (Mon) @ 22:49

Let me try to clarify a few points.

First, I do not claim that the TOP is an exact model of run scoring.  The weights for runner-advance factors are quite empirical--yes, “best-fitted” to the data--and possibly other values could be used and work as well (and that in itself should be grounds for some careful thought about “modelling reality").  My only claim is exactly what is set forth in the article: that running the various formulae listed--which are what I could find around the web, just as I found them (which is why I spent so much text space reproducing them, so no one could wonder what, say, “XR” I meant--against the largest database possible (from 1955 on) produced the results shown.

Let me also point out that I expressly included a variation of TOP that does not use actual error data (though it is publicly available), but merely estimates it, and the difference in accuracy was quite minor.

As to operator priority, I had dropped those comments from what I thought was going to be the final draft (but, obviously, was not).  But it’s not a question of my ignorance or lack of it, it is one of what it is or is not sound policy to assume about whether others know the precedence order; my point is that including express delimiters removes all doubt.

Nor do I or have I ever claimed that the TOP is unique conceptually, only that it works, and was derived independently many and many a year ago.  (I have in hand a document written in 1980 for a major-league club; it uses a somewhat raw but recognizeable ancestor of TOP, which is, after all, a simple multiplicative formula of the type described in the article.) Every so often, I re-visit and fine-tune some of the coefficients, because the database of years grows; but there have been no drastic changes for, probably (I’m not good at record-keeping), decades.

To my perhaps simple mind, the idea is not to win some hypothetical prize, it is to create a tool with which we can reliably go from generally available count stats to the consequent runs.  The various tools tested all work pretty well, especially when we recall, as I emphasized, that 0.14%--which is larger than the differences between many--is on average one run a team a season.


#15    Tangotiger      (see all posts) 2009/11/23 (Mon) @ 23:23

Eric, why not run a linear best-fit of all the parameters you want, like Jarvis did in Table 2 of his article.

Are you agreeing or disagreeing with the statement that the better the fit, regardless of the logical underpinnings, then the better the equation?


#16    Eric Walker      (see all posts) 2009/11/24 (Tue) @ 00:21

Neither, really (at least not in the article): I am saying that the better the fit the more useful the equation.  It is, in essence, a black box into which we dump raw count stats, turn a crank, and get out the runs total to which it is most likely that those stats will have led.

My personal preference is, obviously, for an equation that we can feel really is modelling reality, because one can then use it to at least tentatively test out various hypotheses.  But the relative ease with which often very different coefficients can give, in otherwise-similar formulations, remarkably similar end results means, to me, that we have to take extraordinary care in deciding whether--and why--a given equation is “really modelling” reality.

Earnshaw Cook, almost half a century ago, got something like 3.5% average accuracy; we have spent 45 years knocking 1% off that.  How much more do we really know?  Part of my point there was that there is, again in my opinion, too much prayer being offered at the altar of micro-accuracy.  I wasn’t bragging on TOP, I was showing that with sufficient diligence any plugger, in which role I see myself, can grind out ever more microscopically accurate predictors, which may or may not signify any material advance in understanding.

But they are useful when we want to know from its count stats how a team seems to have performed.  (And that includes on defense as well as on offense.)


#17    Tangotiger      (see all posts) 2009/11/24 (Tue) @ 01:17

The run value of the double is much closer to the single than the triple, in the TOP equation.

As long as you accept that as what the TOP equation is doing, then if you want to call that more “useful”, then so be it.


#18    Eric Walker      (see all posts) 2009/11/24 (Tue) @ 07:00

Well, I guess one could turn the question around: do you think that equations with lower accuracy than some others can be “better” than those others?  What does make this or that equation “better” or “worse” than another?

Right now, I’m tied up on some other projects, including wins equations and fielding (aside from, on occasion, trying to live a life), but sometime this winter I hope to re-visit run estimation ab origine and see what comes out for coefficients.  The present TOP equation is, as I noted before, just a re-tuned lineal descendant of something I’ve been using for 30 years or so.


#19    Tangotiger      (see all posts) 2009/11/24 (Tue) @ 09:02

Yes, definitely something with worse accuracy can be better.  The Jarvis data proved it.  I would never use an equation where the implied run value of a double is just .15 runs more than a single.  That makes zero sense.

The chance of scoring from 1B is around .26 and from 2b it’s .43.  To say that the difference in getting on base AND moving runners over is just .15, when we KNOW that the difference in just getting on base is .17 is frankly nothing short of embarrassing.

This is why regressions have no place in the final understanding of baseball.


#20    Tangotiger      (see all posts) 2009/11/24 (Tue) @ 10:25

But it’s not a question of my ignorance or lack of it, it is one of what it is or is not sound policy to assume about whether others know the precedence order; my point is that including express delimiters removes all doubt.

It does remove all doubt.  But, personally, looking at an equation with that many parens is more confusing, because now I am forced to follow those parens to make sure I got the order correct. The less parens, the better.

And, as I stated, if you had no idea of the precedence rules, you read it left-to-right (after handling the parens).  If you did have a Grade 7 education, you follow the precedence rules, which, in this case, is STILL the same left-to-right.

The only way to make a mistake is if you did the equation right-to-left.  If the equation said:
D+A*B/(B+C)
then you have a point.  Otherwise, it is impossible
A*B/(B+C)+D
to make a mistake here. 

Indeed, why did you even introduce the extra level of parens?  Why not:
[A*B/(B+C)]+D

Regardless, this is purely a question of “pet peeve” and therefore you cannot take to task that someone else doesn’t live up to your personal standards because they followed the conventional norm of precedence rules learned in Grade 7.


#21    Patriot      (see all posts) 2009/11/24 (Tue) @ 10:45

Well, I guess one could turn the question around: do you think that equations with lower accuracy than some others can be “better” than those others?  What does make this or that equation “better” or “worse” than another?

Even if one accepts the premise that equations with better accuracy are “better”, the question of what the standard of accuracy should be is unanswered.  Accuracy when used with team seasonal data?  With team game data?  With team inning data?  In extreme situations? 

All of the traditional accuracy studies (and admittedly, many of my own) look at team season data, because it is readily available if nothing else.  But if we really want to evaluate a model of run scoring, most of the regulars here would agree that there are more telling tests than seasonal accuracy.


#22    Tangotiger      (see all posts) 2009/11/24 (Tue) @ 11:21

Right.

And don’t forget, you don’t get to extrapolate what happens on the team seasonal level to individual players, necessarily.


#23    cdm      (see all posts) 2009/11/24 (Tue) @ 16:27

Woah, someone got up on the wrong side of the (wherever tigers sleep). 

For the life of me, I can’t really figure out why anyone is arguing about this or about the Hakes and Sauer business. 

If you take BaseRuns and TOP, and eliminate the terms that are not included in the other metric (e.g., SF), and run through a little algebra, you’ll find that they are almost exactly identical.  They are both based on the general algorithm:

(1 - Prob. Out) * (Value of not getting out)

You can estimate them both using precisely the same methods (e.g., regression). The only differences lie in (1) the use of different terms (e.g., SF, GDP); (2) slightly different ways of attributing value (e.g., discounting HRs by OBP or not), and(3) slightly different ways of operationalizing value. 

Clearly, regression is not the problem here. The method for estimating the terms is not at fault for any differences in the weights associated with each event. I only spent 15 minutes looking at the equations, but with 15 more, someone could clearly isolate the terms that differ and relate that to a generative/theoretical model of run creation.  No need for all this nastiness.


#24    Rally      (see all posts) 2009/11/24 (Tue) @ 17:27

I put the TOP into a spreadsheet to see what I get.  I tried a good, random typical batter’s line and got 113 runs.  Without the same # into baseruns, I have no doubt the results are similar.  I get problems though when I try things at the micro level.

1 HR in one PA, who many runs?  TOP says 3.9 Baseruns would say 1.  Maybe not relevant though, since no outs are considered.

1 HR followed by 3 outs (inning level data):  TOP passes this test, a result that rounds to 1.  Baseruns, of course, still says one.

How about 3 HR and 3 outs?  Baseruns will tell you 3 score, TOP says 5.8.

TOP, RC, and others may work at the season level, but if you are looking at the inning level, only baseruns should be used.  At the game level?  At the pitcher level?  I suspect that the smaller the sample size, the less you should consider anything other than baseruns.

Jeff Z/#11:

That objection to regression applies to the using it with game level data, but not at the team-season level, where you’re always going to have at least some of every event.


#25    Patriot      (see all posts) 2009/11/24 (Tue) @ 17:54

Some people see nastiness anywhere there is disagreement, apparently.

There are two major differences between the RC/TOP model and BsR.  They are:

1) the treatment of the home run as an automatic run
2) the capping of the score rate at 100%

In both cases, the BsR model provides a clear theoretical upgrade.


#26    Eric Walker      (see all posts) 2009/11/27 (Fri) @ 22:27

I hope everyone had a good Thanksgiving.

It is, I trust, obvious that all run-estimation methods necessarily rely on what one might call the “normal interactions” of the game--that is, on the various odds and ends of improbability averaging out.  It is quite possible to conceive an inning with three triples, two walks, and zero runs scored; it would be unusual, but--unless perhaps in a playoff game--scarcely remembered a week later.  But what formula would give that result?

It is, I also trust, obvious that what works at the season level must work at the game level, and that what works at the game level must work at the inning level.  But, equally obvious, is that at each narrower level the spread of error will be ever wider because there is correspondingly less opportunity for the unlikely things to average out.  (Far beneath we hear the plangent sound of the words “sample size” chiming.)

Just for fun, I picked, quite at random, one game to try out, namely the Giants’ final 2009 game; I have not yet tried any others, so this is scarcely a picked cherry.  The results were:

Team     Actual   TOP   BR    XR
=================================
Giants      4      5     5     6
Padres      3      3     2     2
Now that may be the only ball game ever played within the known universe for which that sort of result obtains; maybe sometime soon I’ll play with some more game data, just for fun.  But it is, I think, at least suggestive of the possibility that none of these formulae “blow up” or disintegrate or whatever at the game level.  And, just perhaps, that the TOP is not at any disadvantage.

(I have also since discovered that TOP accuracy can be further improved by including the Eb (or “ROE") in with singles--and adjusting the coefficients accordingly--but I am still playing with that.)


#27    Patriot      (see all posts) 2009/11/28 (Sat) @ 12:18

It is, I also trust, obvious that what works at the season level must work at the game level, and that what works at the game level must work at the inning level.

I think everyone agrees that what works with *typical* team seasons will work with *typical* team games, at least to some reasonable standard of accuracy.  (IMO, though, it’s a mistake to assume that something like a .15 run difference between the single and the double that improves accuracy on the season level will necessarily do the same on a lower level).

The issue comes when these formulas, which have been tuned and tested on typical team season data, are put in use for extreme situations.  Very high or low scoring games.  Games with lots of home runs.  Individual players.

Since RC/TOP do not cap baserunner scoring percentage at 100%, or treat home runs separately, it is not surprising that they have trouble in high OBA games, or high HR games.

None of this is to say that any model is perfect--most notably, Base Runs doesn’t cap left on base per inning at three (although neither does any other formula), and it has a problem with the triple at very high levels of offense. 

Eric, I would be very interested in seeing what kind of formula you could come up with if you credit one guaranteed run for each homer and cap baserunner scoring at 100%.  Even if winds up just being BsR with slightly different coefficients, it would be interesting to see what an “independent” and intelligent analyst like yourself could cook up.


#28    Eric Walker      (see all posts) 2009/11/28 (Sat) @ 20:59

Actually, I had already begun playing with some of that, though I am not bearing down during this holiday period.  I did a revision in two inrelated ways: I added Eb as if they were singles, and I did move home runs out of net OB then add them back at the end.  Regrettably, I did both at once, so I can’t (yet) say which did what--but the overall result was to lower the average error by almost one run, which I find remarkable at these levels.

(I should note that the coefficients do not come from regression analysis.  I am a much simpler fellow, and use brute-force trial-and-error, albeit with some idea of where to start trying from.)

My initial observation is that moving home runs out had the effect mentioned, of greatly blowing up the coefficient for triples.  Obviously, that is because the TOP and most or all multiplicative formulae treat “on base” values as unweighted: a single and a triple are just “runners on base”, while in fact a triple is a base runner much more likely to score than a single.  The difference is made up, in a fashion, by a boost in the runner-advance weight of extra-base hits, so when the HR is removed, it is of course the triple (and to a lesser extent the double) that get “pumped up”.

I have long felt that the most useful and reliable methodology would be a well-designed simulator, which ought to be more possible nowadays (with the advent of massive, detailed stats), which can then be used to meaningfully study possible variations in everything from tactics to batting orders.

But it’s not a project I am anxious to dive into, having but one life to live.

Capping scoring and LOB is something else that I have not as yet looked closely at.  I have generally been content to deal with data in the realistic ranges, but yes, at the inning level it is possible (though, I suspect, unlikely) that bizarre results can obtain.  I’ll try to spend some thought on that.


#29    Tangotiger      (see all posts) 2009/11/28 (Sat) @ 21:15

"It is, I also trust, obvious that what works at the season level must work at the game level, and that what works at the game level must work at the inning level.”

No, it is not obvious.  Indeed, it is actually not even true.

***

“I have long felt that the most useful and reliable methodology would be a well-designed simulator, which ought to be more possible nowadays (with the advent of massive, detailed stats), which can then be used to meaningfully study possible variations in everything from tactics to batting orders. “

I have done exactly that.  And it is for this reason I can so forcefully claim that the gap in runs between a single and double is .30 runs, not .15 or anything else.


#30    Patriot      (see all posts) 2009/11/28 (Sat) @ 22:58

Great, I look forward to seeing what you come up with.

One thing I’ll note for the others is that without capping the score rate at 100% but removing the home run is what Eric Van does in his Contextual Runs.  His structure is:

Baserunners*advancement/outs + home runs


#31    Eric Walker      (see all posts) 2009/11/29 (Sun) @ 22:25

I continue to play around with variations, but in the interim I would be interested in seeing the exact full runs-scored equation that includes in it the various weights the simulator has generated; we can then see how that runs against the same 55-year database that the others were run against.  (I assume that it is something other than the exact BaseRuns equation.)

Also, I am unclear on what a “gap” signifies there: is that saying a team for which we convert in its seasonal stats one single to a double will add 0.3 run to its projected annual total?

As to “No, it is not obvious.” My reasoning, if we can call it that, is that run-scoring is atomic, and the atom is the inning.  No stats accumulated in some one inning affect run scoring in any other inning: each 3 outs is a closed set.  The results for a game are the amalgamated results of 9 (or thereabouts) innings, and the results for a season are the amalgamated results for 1458--or thereabouts--innings.  Aside from the expected variation in error sizes owing to sample size, what works best ("on average”, as always) for an individual inning should be what works best for any larger amalgamation; but if that be so, then what works best for any larger amalgamation ought correspondingly be what works best at all sizes down to the individual inning.

What is the error in reasoning there?


#32    Tangotiger      (see all posts) 2009/11/29 (Sun) @ 23:27

The amalgamation is NOT done randomly, but based on batting team.  That introduces bias.

Indeed someone at Hardball Times (was it Colin Wyers?) who looked at the inning-level linear weights and came up with numbers that match what we got from Palmer.


#33    Tangotiger      (see all posts) 2009/11/29 (Sun) @ 23:30

Looks like I did it:

http://www.insidethebook.com/ee/index.php/site/comments/inning_level_linear_weights/

Now, let’s take care of all that non-randomness with regression.  Taking the 38,830 3-out innings of 2008 only, I get the following coefficients for the regression (r=.875):
+0.36 BB
+0.51 1B
+0.53 Error
+0.78 2B
+1.02 3B
+1.42 HR

Now, those numbers look VERY NICE.  They are pretty much what we expected, give or take .03 runs.

However, and this is why we don’t want to be a slave to the regression, the coefficient for the hit batter is +.26 runs, and for the IBB, it’s +.43 runs.  The IBB is especially ridiculous, since they are given out with 1 or 2 outs, and so, don’t have as many opportunities to score.  The standard error is .013 runs, meaning that we are 95% sure the run value is between .40 and .45 runs.  Like I said: ridiculous.  You have to use the regression as a tool, and not be a slave to it.  You must be tempered by your baseball senses.


#34    Kincaid      (see all posts) 2009/11/29 (Sun) @ 23:39

You run into problems using only seasonal run scoring for teams because it relies on measuring outcomes within the normal range of outcomes.  The observed range of outcomes among all team-seasons is a lot tighter than it is on an inning or game level, so when you apply the findings of a best fit analysis of seasonal run scoring to the game or inning level, you have to extrapolate it well outside of your sample to much more extreme outcomes.  A 6 home run or a zero walk game will happen, but you’ll never get that outcome looking at only full seasons, so that kind of analysis doesn’t tell you anything about how runs are created in those environments.  Using seasons basically just looks at the very center of the range of outcomes and then extrapolates that small range of outcomes to include all the more extreme outcomes that occur regularly on the game or inning level.

Say you want to use your findings to look at how much Albert Pujols or Juan Pierre are worth.  You are never going to find a data point on the seasonal level that comes close to simulating either of those players’ production, so if you try to use such a system to evaluate players (which is often the point of these things), you have to extrapolate the results of the best fit analysis far outside your sample, and you have no idea how well it holds up when you go that far from the range of your sample.  You are just hoping it still works in extreme situations.  By looking at the inning level or even the game level, you get a lot more data points that cover a much wider range of outcomes.  You will get data points that extend to cover Albert Pujols-like environments and Juan Pierre-like environments.  So if you are going to do a best-fit kind of analysis and use it for anything other than looking at team run-scoring over full seasons, doing it at an inning level or a game level will do better because you are not excluding extreme data points by looking only at the range that teams settle into over a full season, and you don’t have to extrapolate the results so far outside the sample to be able to use them.


#35    Eric Walker      (see all posts) 2009/11/30 (Mon) @ 00:35

I am confused.  Surely the quotation is not meant to suggest that runs scored is simply the sum of the stats listed, each times the coefficient shown for it?  That is so far off when applied to the 55-year database that it cannot be intended to mean that.  What I had mentioned was some equation that would deliver actual runs scored.  I imagine I’m being obtuse here, but what would that equation be?

Which returns me to the other line of argument: if an equation works well at the inning level, surely it must work as well--relative to others--at higher levels.  That is, surely the most accurate inning-level equation must then necessarily be the most accurate equation at any level comprising multiple innings.

But if that is so, how can it not be that the most accurate equation at any higher level is not also the most accurate, on average, at the inning level?  To assume otherwise violates the earlier proposition.  Yet how could it not be so?  That is the part I seem not to get.

Nor should it matter if the amalgamated innings are all from one team in one season, or are selected at random from any and every team over the past 55 years.  Had I an inning-by-inning database, I’d love to verify all this, but I do not. I don’t even have game-by-game stats, though when I picked one game at random, nothing went wild anywhere (and, I am now almost reluctant to note, TOP beat BR and XR).

Now, as I say, I’m not sure I’m properly getting the import of some of the cited material, but it looks as if it is saying that if we have, in some one inning, a single and a triple, then the average runs in such an inning will be, depending on season, 1.53 (0.51 + 1.02) or 1.58 (averaging 1.55 over the three seasons shown).  I mention that because I happen to have recently checked that case; I found that (rounded off to two places) BaseRuns predicts 1.24 runs and TOP predicts 1.57 runs.  Considering that 1.55 average, what might those results signify?


#36    Kincaid      (see all posts) 2009/11/30 (Mon) @ 01:14

If you’re just doing a best-fit analysis of a sample, then the premise that the most accurate result at the inning level must also be the most accurate at the season level is false.  All that is doing is taking a sample of data, picking out a set of variables that are likely important, and seeing what formula matches most closely with that particular sample.  You’ll get different results when you use different samples, especially when the samples are wildly different, such as using one that includes a lot of 6 run innings and 0 run innings (the inning level) compared to one that forces all environments into a relatively tight range.  There is nothing to guarantee that what best fits the data in a tight range of outcomes will also best fit a different set of data that contains a lot more extreme data points.  In fact, if all you are doing is finding a best fit to a set of seasonal data and then testing the accuracy against that same data, you are always going to find that to be more accurate than the data that works best at the inning level, because that is what a best fit analysis is.  It’s the best fit you can find for the data you are testing, which means if you did it right, the results of a different analysis using a similar formula but different coefficients are only going to be worse.

You also have the issue that none of the variables are independent of each other, especially at the season level, and a best fit on the data might not be a powerful enough tool to properly separate the values of each individual event.  You can end up with results where you can change the coefficients around with minimal effect on the results.  You might find that a best fit with doubles set at .15 runs better than a single is only a marginally better fit to the data than one that sets the value of a double at .3 runs better than a single and adjusts the other coefficients accordingly.  You could find that the gain is so small that you can’t really tell which one is truly a more accurate representation of how each event affects run scoring without also creating a logical framework to work around, so just because a best fit analysis fits the data it was best fit to better than a system that starts with a logical framework does not mean that the best fit does a better job at assigning values to each event.  It just means that, given those data points, it worked a little bit better, but will not necessarily work better when extrapolated outside of those data points.


#37    Eric Walker      (see all posts) 2009/11/30 (Mon) @ 08:04

The point about whether fairly large shifts in coefficients can result in fairly similar degrees of accuracy is well taken, and one I have mentioned in this connection.  From experience, I don’t think it applies here, but I have been planning to look more closely at it anyway.

But while it is necessarily so that a best fit fits best, it is not as if the fit was determined by examination of all-MLB stats or, worse, all-time (or all-some-long-period) all-MLB stats: it was gotten over 1138 individual team-seasons.

The acid test, of course, is how well does it actually work when actually used at granular levels, such as the individual game or even the individual inning?  And how, at those levels, does it compare with others that attempt such measuring?

So far, all I can set forth is a spot check or two, but at least in those isolated cases, it worked quite well, and better than such others as I tried.  I plan to do some further spot checks, and will put some up when I get them done.  Even if, unlike what I have so far, they show glaring discrepancies.  (Or, if anyone else has a game-stats database and wants to run some trials, email me and I’ll send the updated formula I am now using, which is too messy to post here, unless some one expressly wants to see it.)


#38    Tangotiger      (see all posts) 2009/11/30 (Mon) @ 08:54

Surely the quotation is not meant to suggest that runs scored is simply the sum of the stats listed, each times the coefficient shown for it?

The point of the quotation is to make people go to the link and read the article.  In no way can you infer what I meant by the quote if you do not give me the courtesy of reading that thread.

So, please, before you say anything else, read the thread.


#39    Colin Wyers      (see all posts) 2009/11/30 (Mon) @ 12:31

Eric,

Can we quit talking about hypotheticals when it comes to testing run estimators at the inning level? If you can lay your hands on a copy of the ‘09 Hardball Times Annual, you can see tests done using BaseRuns, Runs Created and Extrapolated Runs at the individual inning level.

Despite being roughly equivelent in performance at the team-season level, BsR performs much, much better at the inning level - RMSE of 0.583 for RC, versus 0.425 for BsR (and 0.495 for XR, which is rather staggering when you consider the advantages a dynamic run estimator has over a linear one when you use extreme run environments).

The fact is, we have looked at the relationship between run estimators at the inning level, versus at the team level. There is not a perfect correspondance. A run estimator can be reasonably accurate at the team level without being very good for smaller atomic units of run creation.


#40    Eric Walker      (see all posts) 2009/11/30 (Mon) @ 22:11

Just a progress note: as I proceed to do some per-game evaluations, after 4 games (which is 8 samples), scattering over seasons, months, leagues, and teams, here are the results so far:

Date      Org  Vs. Runs  TOP  Err   BR  Err   XR  Err
20091004  SFO  sdp    4    5   -1    6   -2    5   -1
20091004  SDP  sfo    3    3    0    2    1    2    1
19550411  BAL  wsh    5    3    2    3    2    2    3
19550411  WSH  bal   12    7    5    7    5    7    5
19600419  CHC  stl    2    3   -1    3   -1    3   -1
19600419  STL  chc    5    6   -1    7   -2    6   -1
19650504  CHW  det   10   10    0    7    3    7    3
19650504  DET  chw    6    6    0    4    2    4    2

  TOP Avg. Error Size: 1.25
       TOP Cum. Error: 4

BaseR Avg. Error Size: 2.25
     BaseR Cum. Error: 8

   XR Avg. Error Size: 2.125
        XR Cum. Error: 11

I want to end up with a dozen games (24 samples), one every five years, both leagues equally represented, 2 from each month of the year, no repeated teams.  I’ll post again when I’m done (later tonight--Monday--I hope).

Note: no cherry-picking: just a by-golly selection of a game from the given season (1955 and every 5 thereafter), for the wanted league (alternating), from somewhere at the start or middle of the wanted month (progressing from April), no teams repeating.

But it’s not hard to see what the developing trend is.

Why are so many people upset by this?


#41    Mike Fast      (see all posts) 2009/11/30 (Mon) @ 22:27

Eric, people are “upset” because you are ignoring so much of the work that has already been done in the field, e.g., Colin’s article in post #39.

If TOP were a better run modeler, people here would be rushing to embrace it.

Also, Tango has been beating the drum lately about how regression gets misapplied by so many people to answering baseball questions.  Your run modeling analysis just happens to be the latest target, but he’s shown the problem with several other people’s articles, too.  I wouldn’t take that personally or assume that he’s “upset” about TOP.  To the extent that he’s upset, and Tango can correct me if I’m wrong, it’s that people are using respected baseball platforms to do bad baseball analysis, and most readers are not going to be sophisticated enough to realize that “best fit” to the data doesn’t always mean best answer to the question.


#42    Tangotiger      (see all posts) 2009/12/01 (Tue) @ 00:53

Eric: did you read my article?  I used 38,830 3-out innings in 2008.  38 THOUSAND.  And I repeated it for 2007 and 2006.  And Colin did an extensive look as well.  I don’t know what you are trying to do by giving us a few games here and there.


#43    Eric Walker      (see all posts) 2009/12/01 (Tue) @ 02:56

Mike: I would be delighted if Colin, or anyone, could and would email me a set of individual-inning data, the bigger the better.  I would dearly like to try all these out on a large and representative set.  (And I don’t see how anyone could be expected to “rush” to TOP since I have not ever gone out of my way to ballyhoo it, and in fact am still working on revisions even 30 years after.) All I want to do is try them and let the best win, so that we can see why whatever is “best” is best.  So far, no one has--so far as I can see here--actually tried TOP.

I did, at least at the game level.  I say right here that it is a very small sample size, 24 game lines, though I did my best to scatter them around in every way possible: different years (even eras), different leagues, different teams, high-scoring, low-scoring, zero-scoring, even different months of the season.  This is what I got, in summary:

  TOP Avg. Error Size: 1.5
       TOP Cum. Error: 0

BaseR Avg. Error Size: 1.83333333333
     BaseR Cum. Error: -4

   XR Avg. Error Size: 1.875
        XR Cum. Error: -11

TOP "beaten": 3/24

Owing to the sample size, that is all very, very far from probative--but is it stretching to call it suggestive, and maybe more than just suggestive?  (The full, gory details can be seen on a temporary web page I put up--not an official part of the site--at:

http://highboskage.com/formula-notes.php

Tangotiger: I wish I had that database.  But why not try TOP for yourself against it?  (Or, if you are feeling generous, email it to me if it is not proprietary.) The exact equation is on the page I just cited.

Also, there seems to be some confusion between the coefficients I am showing and the actual net results.  I suspect there is, in fact, little difference between the actual results from using TOP and the event values you got from your work.  As I said in an earlier post, when I try an inning with one single and one triple, I get a runs-scored value strikingly similar to what (if I am reading it aright) you set forth in an earlier post (and the web page it referenced).  And it is more similar than what BR gives.

Finally, I don’t know why anyone thinks that TOP is some variant of a linear-weights formula, but unless I am mis-reading, that seems to be what several posters believe of it.  It is not a linear-weights method, and it was not constructed by the use of regression.


#44    Kincaid      (see all posts) 2009/12/01 (Tue) @ 04:24

I think the objections regarding regression are more just regarding constructing a best-fit of the data to determine the coefficients, which is basically the idea of a regression model.  From what you said earlier about using trial and error to tweak the coefficients, it sounded like you were basically doing a manual best-fit to the data, albeit without going through the actual process of running a regression.  The objections people have with regression models are still going to apply if your coefficients were tweaked by trial and error to work best with the data you were working with.


#45    Kincaid      (see all posts) 2009/12/01 (Tue) @ 04:46

Eric, I tried to email a CSV file of inning by inning data to the address listed on your website, but it was rejected as too large.  Do you have another address with a larger size limit, or know what the size limit is for your email?  The file I tried to send had 2000-2008 data and was just under 20 MB, but I can cut down the years to shrink the file size if needed.


#46          (see all posts) 2009/12/01 (Tue) @ 10:57

It does seem fairly clear which way the wind blows.

In the great descent from aggregations of from 1386 to 1458 innings (more or less) to aggregations of a mere 8, 9, and 10 innings, the relative strength of TOP for predicting runs from the raw data is unaltered: it remains best. The next step will be, as and when data become available, to try these methods on a reasonable number of individual innings. But in the light of what we have just seen, it would take a daring soul to place a sizeable wager against TOP.

All this after testing a sample of 24 games.  I’m not saying that TOP won’t test as more accurate over a larger sample (although it’s not where I would put my money), but to place such confidence in a 24 game test sample is kind of shocking.

Kincaid, if I may be so uncouth as to stick my hand out, would you be willing to send me your inning-level data (email above, w/ 25 MB limit I believe)?


#47    Tangotiger      (see all posts) 2009/12/01 (Tue) @ 12:07

I can generate whatever anyone wants.  I presume Kincaid did the same thing.  You can try zipping it, and emailing it to me, and I can post it on my site if you want.

***

If the question being asked is a count of the traditional categories by inning, I can do that too.

***

Eric: the main objection is that you are doing what Colin has already done, except Colin has done it for thousands of innings.  So, it’s a step backwards.  If Colin were to provide his dataset, then we can all be looking at the same data.

And, I can’t even tell if you read my link.  Did you read my article or not?


#48    Greg Rybarczyk      (see all posts) 2009/12/01 (Tue) @ 13:51

Scratching my head here wondering if I am understanding the objective of all this work.  Are we just trying to tweak (or confirm) the value coefficients for different game events?  Or is there more to it? 

What are the delivered benefits of a (hypothetical) perfect run estimator?


#49    Mike Fast      (see all posts) 2009/12/01 (Tue) @ 15:21

Greg, the goal here is a model that accurately explains how runs score in baseball, ideally a model that could be used accurately across many environments and sample sizes--for innings, games, hitters, pitchers, teams, leagues, etc.

Then you can take such a model and derive linear weights for various events appropriate for any given situation.  Of course you know the value of such linear weights--you can value fielding events in a defensive evaluation system like UZR, you can value pitches and say who has the best curveball or changeup in MLB, you can say which baserunner gained or lost the most runs on the bases for his team, etc.


#50    Tangotiger      (see all posts) 2009/12/01 (Tue) @ 15:31

Objective?  The thread title ends with “why?”. I don’t know what the objective is frankly. 

Eric is making some claims that are invalid (what works with one should work with the other).  And he is unaware of current research by Colin and myself among others.

If anything good is to come of this thread, it’s that people will be brought up to speed.


#51    Mike Fast      (see all posts) 2009/12/01 (Tue) @ 16:42

Yeah, if the question is why Eric did his work and wrote his article the way he did, I can’t answer that. 

I thought Greg’s question was a more general one about the purpose of work by Patriot and others on the development of improved run scoring models and the measurement and comparison of these models by Tango and Colin and others.  That effort is one of the main underlying foundations of modern sabermetrics, IMO.


#52    Mike Fast      (see all posts) 2009/12/01 (Tue) @ 16:47

Of course, as soon as I hit submit on #51, I realized I should clarify something.  I realize that the run scoring models used today have their roots in work done decades ago.  I’m not suggesting all the important work was done by the few guys I mentioned.  What I am arguing is that much of the best work of modern sabermetrics within the last decade has keyed on the widespread understanding and use of linear weights, after many of us grew up seeing them besmirched by Bill James.


#53    Rally      (see all posts) 2009/12/01 (Tue) @ 17:37

Eric,

The database used is freely available from retrosheet.  To use the data, Colin has written some good tutorials on setting up a retrosheet database on Hardball Times.  Or, pick up Joe Adler’s book Baseball Hacks. 

If you don’t have the technical expertise to go that route, fine, but don’t expect people who have looked at tens of thousands of innings to care what a few games here and there say, whether you are choosing them randomly or not.


#54    terpsfan101      (see all posts) 2009/12/01 (Tue) @ 21:46

I tested Eric’s TOP no error data formula using gamelogs from Retrosheet 1993-2009 (79140 team batting lines). I also tested the Baseruns equation that he has listed in his article. I get an average error of 1.3 runs per game with TOP, and an average error of 1.23 runs per game with Baseruns. Of course you could reduce the average error even more by using a more finely tuned Baseruns formula.


#55    terpsfan101      (see all posts) 2009/12/01 (Tue) @ 22:41

Applying Eric’s TOP to team seasonal data from 1955-2008 (excluding 1981 and 1994), I get an average error of 22.2 runs per team season. With Baseruns, I get an average error of 18.8 runs per team season.


#56    Eric Walker      (see all posts) 2009/12/01 (Tue) @ 22:44

Kincaid: First, thank you for trying to send the DB.  I have looked at my host’s email control page and find no reference to file size, but I suppose there is indeed a limit.  I also appreciate Tangotiger’s offer to accept and post it.  If the file was raw, not zipped or otherwise compressed, perhaps a compressed version might be small enough for emailing.  Or you could post it or pass it on to Tangotiger.  Or, easiest of all, you might just ftp it; the ftp address would be--

ftp.highboskage.com/incoming

--and anonymous ftp login works (but let me know if you go that route, because I do not normally check that inbox).

Patriot: re “to place such confidence in a 24 game test sample is kind of shocking.” You mean like placing confidence in the A. C. Nielsen company’s ability to measure the viewing tastes of 300 million Americans with a sample of a few hundred households?  The folk hereabouts are world-class experts in probability: what do you reckon the odds are that that sample of 24 games is in fact materially UNrepresentative of the whole, regardless of the size of that whole?  That a larger or different sample would give notably different results?  I could expand the size arbitrarily, but transcribing game-data lines from box scores takes a finite time per game, and I feel no overpowering urge to go on with it, because I feel that the point is made; but if someone already has a DB of game lines (not inning lines), I’d be glad to run against that.

I feel that somehow the point is getting lost here.  As Mike points out, we want a formulation that is convincingly accurate because that then gives us some confidence that that formulation indeed tells us how the events of the game interact to produce runs, so that we can then make further investigations, using that formulation, of everything from individual batters’ contributions to tactics evaluation.

The ancient Greeks, famed for their intellectual prowess, employed almost exclusively the deductive approach to knowledge, debating and drawing from axiomatic bases various conclusions.  That approach led to such deep insights as Aristotle--"master of them that know"--announcing that smoke rises because the heavens are its natural place, and that men and women have different numbers of teeth.

Less elegant and refined but strikingly more successful has been the age of science and the inductive method, wherein one actually goes into the laboratory and does experiments, a concept the Greeks apparently found repellant (even Hero never did anything practical with his idea).

Given various equations, each of which purports to depict the results of a certain process based on certain component elements of that process, the only meaningful thing anyone can say about the validity of any of those equations, absolutely or relative one to another, is “I have tried them against X amount and type of data and these are the results.” Results are facts; anything else is theory, which is to say handwaving.  Facts, as Willy Ley once said, have sharp corners while theories tend to have tender skins.

The fact is that the TOP equation at the team-season level is clearly the most accurate: not by a huge margin, but not by a trifling one.  Another fact is that when applied to a profoundly varied random selection of 24 game lines, it was again the most accurate.  The odds that such a sample is materially unrepresentative I leave to the house to calculate.

I am not merely willing but anxious to test TOP against a substantial DB of individual-inning results.  I will be more than surprised if the results relative to those given by other formulations are not robust, but that’s why they race horses.  But the crux is that only such results can carry weight in any discussion of relative merits.

None of this is to make me out some genius.  The methodology is good old Earnshaw Cook run through some fairly obvious minor improvements followed by some tedious mere plodding through trials to arrive at coefficients.  Get ‘em on; knock ‘em in.  Not rocket science.

The original article I wrote was not designed to exalt TOP; its thrust, and conclusion, is that quite a number of fairly varied approaches all come up with results that are pretty good; the clear implication is that this is a reasonably mature art now, and that analysts have a good handle on how the game works.  (I was, in fact, a little surprised that TOP did so well; my personal games-won formulation only comes in second; one day I’ll write those up, too.)

Moreover, as I keep mentioning, when the data is fed in and the crank turned, nothing seems to be coming out that runs strongly counter to the results others have gotten.  (Of course, they couldn’t, else the results would not be accurate.) The individual coefficients in the equation are essentially immaterial: the results delivered are what is not.

I again thank Kincaid, and look forward to somehow soon acquiring that DB.  (And please let me know if it is to be considered proprietary.)


#57    terpsfan101      (see all posts) 2009/12/01 (Tue) @ 23:10

”...if someone already has a DB of game lines (not inning lines), I’d be glad to run against that.”

Eric, see post #54. I just tested TOP against nearly 80000 team batting lines from 1993-2009.


#58    Nick Steiner      (see all posts) 2009/12/02 (Wed) @ 01:32

If I’m following correctly, TOP has been the best formula to predict team runs at a seasonal level.  Given that it is has been ‘tweaked’ to do just that, Tango, et al, believe that testing it against games or individual innings will show how it compares to other run estimators across a more dispersed set of environments.  And Eric agrees that such a test is better than a seasonal test as well. 

Eric then tests a sample of 24 games, and sees that TOP is also best at a game level.  However, 24 games innings is a very small sample, and doesn’t come close to proving which estimator is better.  Colin, in last year’s THT Annual, tested it against ~30,000 innings, and found that Base Runs was the best.  Terpsfan tested it against ~80,000 games, and found that Baseruns is better. 

What exactly is the controversy here?  I think everyone agrees that:

80,000 games/ 30,000 innings > 24 games > ~60 years of seasonal data.  Right? 

Doesn’t Terpsfan’s and Colin’s tests pretty much seal the deal on this one?


#59    Patriot      (see all posts) 2009/12/02 (Wed) @ 01:37

Patriot: re “to place such confidence in a 24 game test sample is kind of shocking.” You mean like placing confidence in the A. C. Nielsen company’s ability to measure the viewing tastes of 300 million Americans with a sample of a few hundred households?  The folk hereabouts are world-class experts in probability: what do you reckon the odds are that that sample of 24 games is in fact materially UNrepresentative of the whole, regardless of the size of that whole?

As for myself, I make no claim to being any sort of expert in probability.  But it seems clear to me that testing 24 games is an order or magnitude short of surveying a few hundred households.

One might be able to get a decent handle on a political race by sampling 500 people, but there’s still a margin of error there, and a significant one (4.4%, I believe).  What you are doing by citing the results of the 24 games and saying things like “It does seem fairly clear which way the wind blows” is brushing that margin of error aside and concluding that the candidate that leads 51.5-47.5 in the poll should start putting together a transition team.  I don’t see how results on the magnitude of what you are reporting for those 24 games DON’T fall within the margin of error.

In any event, it really doesn’t come as a surprise that TOP as you have reformulated it (taking the HR out, and with a construct of baserunners*advancement/PA + HR) would challenge the accuracy of BsR.  The model still does not cap baserunner scoring %, but even with the majority of normal innings that may not be an issue.  The only construction difference between the new TOP and BsR is the use of PA in the denominator of the scoring percentage rather than (advancement + outs).


#60    Patriot      (see all posts) 2009/12/02 (Wed) @ 01:38

The fact is that the TOP equation at the team-season level is clearly the most accurate: not by a huge margin, but not by a trifling one.

...

The individual coefficients in the equation are essentially immaterial: the results delivered are what is not.

There are a number of issues with season aggregate data and the associated accuracy studies (which is not an accusation against you as I and countless others have undertaken essentially the same exercise and are subject to the same pitfalls).  Do the formulas consider the same inputs?  Were all the formuals “tuned” against the same data?  Are the formulas being tested against an independent dataset than the one against which they were tuned? 

Even when all of those issues are addressed satisfactorily, to evaluate a model of run scoring in general, rather than a specific implementation of it, it is necessary to look at other things, like (cont)


#61    Patriot      (see all posts) 2009/12/02 (Wed) @ 01:45

How do the intrinsic event values produced by the model match up against the evidence we can get through other techniques (like empirical linear weights, simulators, and Markov models)?  How well do they match with the logical constraints of run scoring (i.e. 1 run for each home run, Runs less than or equal to (Baserunners + homers), LOB/inning less than or equal 3, Runs greater than or equal to 0)?  And how can the model be effectively applied to evaluating the performance of individual players should that be a desired implementation of the formula (which it often is at some point)? 

As you note, all of these formulas work on the team seasonal level, and wouldn’t be seriously proposed if they didn’t.  And so in attempting to ascertain which model is the most useful, or the best theoretically, or the most practical, or whatever the objective may be, the seasonal accuracy tests are a small part of the big picture.  I think that is what many of the other commenters here are saying as well.


#62    Tangotiger      (see all posts) 2009/12/02 (Wed) @ 08:09

"The fact is that the TOP equation at the team-season level is clearly the most accurate: not by a huge margin, but not by a trifling one. “

I can get BaseRuns to be much better fit to the seasonal data by simply altering the weights such that the doubles one goes down.  Basically, you are best-fitting the data to your sample and then testing it against THE SAME SAMPLE!!  Well, gee, no wonder the best of all the equations is the one that Jarvis produces with a run value of a double that is .15 runs more than a single.

Eric: do you get this part?

***

And terps just said:

“Eric, see post #54. I just tested TOP against nearly 80000 team batting lines from 1993-2009. “

***

Can we agree on one thing: if we are going to test various metrics, then ALL of the metrics need to have been “best-fitted” against the SAME sample (and using IDENTICAL parameters), and then tested against OUT OF SAMPLE?  Can we agree on that?

Otherwise, this discussion is going nowhere.


#63    Rally      (see all posts) 2009/12/02 (Wed) @ 13:14

Eric, I see no point in discussing the accuracy of a sample when people like terpsfan101 in #55 has already tested the equations against the entire population.

Sampling is done when a census is impractical, but if someone took up the monumental cost of obtaining a full census of weekly TV watching, and it disagreed with Nielsen’s results for that week, I know which one I’d go with.


#64    Eric Walker      (see all posts) 2009/12/02 (Wed) @ 23:38

Adam was kind enough to send me on a database of individual innings, covering the period 2000 - 2008 inclusive, and I have now run the three equations--TOP, BR, and XR--against it.  In the cases of TOP and BR, I optimized the coefficients for that particular period, which made small improvements for each.

The nay-sayers will doubtless be glad to hear that at this level, BR was better--by an average of a whole 0.00944114 of a run per inning.  Decisive.

The very short-form results were:

Innings: 389,042

 BR Average Error: 0.232548157    -------
TOP Average Error: 0.241989297    +0.00944114
 XR Average Error: 0.261624709    +0.02907655
I am working on a detailed write-up of all this, which I will put up on the web page cited earlier in this thread, and will post a note here when that is done.

While my own feeling is that did I have a larger database (1955 - 2009, for example), it might have been closer or even reversed, because it looks to me (subjective impression) that the TOP coefficients vary less with period than does the key coefficient for BR (that is the nominal 1.1 multiplier for B, of which Smyth said “If you want to tailor a version to a particular dataset (such as 1993-2004, or the 1975 AL), all you have to do is determine the overall B multiplier,” which I did (1.05069).

I also optimized BR for the same period TOP was initially optimized for (1955 - 2009), getting 1.0889 for that coefficient, and re-ran the original test.  The updated results don’t change much--

TOP Average Error Pct.: 2.32054013159
 XR Average Error Pct.: 2.53012140594
 BR Average Error Pct.: 2.68385581113
--which is another reason I suspect that inning-level results over a broader database would slightly degrade BR’s results.  By that, I mean it seems somewhat more sensitive to differences in play environment, and the longer period, 55 years, encompasses three different levels of scoring (1955-1976, 1977-1992, 1994-2009—1993 was a “blended” year, probably owing to the new ball being introduced gradually).

Anyway, I cannot see how anyone can say that one or another of these three formulations is obviously better or worse than another, or that any of them “break down” at the atomic inning level.

(I also wish I had an extensive game-level DB.)


#65    Eric Walker      (see all posts) 2009/12/03 (Thu) @ 01:41

I have now updated that page, as promised; you can get all the gory details there.  I will doubtless add more some other time, perhaps tomorrow, but I am exhausted.

(But I again ask if anyone has either a by-innings DB including more or other than 2000 - 2008, or a by-games DB of any sort, I would be deeply gratified if it--or they--could be passed to me.)


#66    Eric Walker      (see all posts) 2009/12/03 (Thu) @ 02:13

Somehow, in my fatigue (donated blood earlier, maybe that’s it) I missed terpsfan101’s two posts.

Re “I tested Eric’s TOP no error data formula using gamelogs from Retrosheet 1993-2009 (79140 team batting lines). I also tested the Baseruns equation that he has listed in his article. I get an average error of 1.3 runs per game with TOP, and an average error of 1.23 runs per game with Baseruns. Of course you could reduce the average error even more by using a more finely tuned Baseruns formula.”

Is it possible to pass the DB along?  I’d like to try it myself and see if we are doing different things.  I ask especially owing to:

“Applying Eric’s TOP to team seasonal data from 1955-2008 (excluding 1981 and 1994), I get an average error of 22.2 runs per team season. With Baseruns, I get an average error of 18.8 runs per team season.”—because there I have the data, and those aren’t the results I get.

For BaseRuns, using optimized coefficient:

Cumulative Error: 331
Per-TmYr Error: 0.243024963289
Average Error Size: 18.6292217327
Average Error Pct.: 2.68385581113
Standard Deviation: 23.39814483
Negative: 50.1%
Zero: 1.4%
Positive: 48.5%

For TOP, errors estimated:

Cumulative Error: -2602
Per-TmYr Error: -1.91042584435
Average Error Size: 18.3142437592
Average Error Pct.: 2.63831600555
Standard Deviation: 23.3107619739
Negative: 52.4%
Zero: 2.1%
Positive: 45.5%

In short, 22.2 and 18.8 versus 18.3 and 18.6; in fact, just to see if the strike-shortened years (and 2009) make any material difference, I re-ran without those; I get 18.5 for TOP and 18.8 for BaseR.  Clearly, one of us is doing something wrong, and the difference is major.

I hate to lengthen this post, but here are my PHP code snippets for each (the rest of the scripts are indentical):

TOP
---

    // Estimate Errors:
    $bip=($ab-$hits)+$sh+$sf;
    $eb=$erate*$bip;


    // Calculate Runs:
    $obraw=$hits+$bb+$hb+$ci+$eb;
    $ob2=$obraw-$hr-$cs;
    $rlob=(($Kob*$ob2)+$hr);  // calculated, not true, raw
    $rlob=round($rlob);  // calculated, not true, rounded
    
    $wtb=($K1*($sgl+$eb))+($K2*$dbl)+($K3*$tpl)+($K4*$hr);
    $bbhb=$bb+$hb;
    $factor=$wtb+($Kbbhb*$bbhb)+($Ksh*$sh)+($Ksb*$sb);
    $factor=$factor/$truepa;
    $block=$factor*$Kslope;
    $multiplier=$block+$Kb;
    $proj=(($rlob-$hr)*$multiplier)+$hr;
    $proj=round($proj);

where:
  $erate=0.0176734746015;
  $Kob = 0.907527925021;
  
  $Kbbhb=1;
  $K1=2.38782;    // 1.00
  $K2=3.37;       // 1.41
  $K3=6.09;       // 2.55
  $K4=3.7704;     // 1.58
  $Ksb=1.52;
  $Ksh= -0.4859;
  $Kslope=0.499377343455;
  $Kb= -0.0600151086521;

BaseRuns:
---------

    $parta=$hits+$bb+$hb-$hr-(0.5*$ibb);
    $partb=$Kbr*((1.4*$tb)-(0.6*$hits)-(3.0*$hr)+(0.1*($bb-$ibb+$hb))+(0.9*($sb-$cs-$gdp)));
    $partc=$ab-$hits+$cs+$gdp;
    $partd=$hr;
    $proj=($parta*($partb/($partb+$partc)))+$partd;
    $proj=round($proj);

where:
  $Kbr=1.0889;


#67    dave smyth      (see all posts) 2009/12/03 (Thu) @ 18:57

I think the most telling point was from Patriot way back in post #1.

Essentially, TOP is RC, in the same or similar way that Base-Out % is Total Avg and XR is ERP. The underlying structure is the same, implying that the underlying concept is the same. The only difference is in the details. A few things that TOP does, such as the constants, apparently result in its showing more accuracy than RC in the usual tests. But IMO these modifications to RC also make TOP somewhat ‘uglier’ than RC (no offense, Eric). It’s up to the user to decide what to make of that tradeoff.

At some point Eric should have realized that TOP is really a dressed-up version of RC and acknowledged that (maybe he has, but I didn’t see it in the linked article). And it doesn’t really matter that TOP was apparently developed independently of, and contemporaneously with, RC. For reasons we all understand, RC became the brand-name for this particular construction.

And it’s true that there are also similarities between RC and BsR. That’s because I started out by trying to improve the RC model. But IMO, the changes I made changed everything, as those who have worked with BsR are aware. Enough so that BsR should be considered an independent entity, instead of a knock-off of RC. If someone believes otherwise, that’s fine with me. It’s quite true that I might have never developed BsR without RC being there to start from. So, it depends on where someone chooses to draw the line between original concept vs refinement of existing concept.


#68    Tangotiger      (see all posts) 2009/12/03 (Thu) @ 20:22

I would characterize as BsR as being inspired by RC, but nothing more.  It does two things that puts it in the genius category:

1. it separates out the HR; as we know, especially at the game level, RC can’t handle innings with multiple HR (3+), while BsR handles it cleanly.  Proof is here:
http://www.tangotiger.net/rc3.html

And since I know that some people don’t bother clicking links, here’s the relevant data:

Runs Scored, breakdown by HR hit

HRclass n R BsR LWTS RC
0 33,068 3.08 3.06 3.79 3.03
1 23,117 4.62 4.62 4.44 4.66
2 9,218 6.12 6.12 5.00 6.41
3 2,838 7.65 7.65 5.62 8.37
4 687 9.03 9.00 6.07 10.29
5 146 10.55 10.49 6.73 12.45
6 40 12.33 12.32 7.52 15.35
7 9 16.22 14.32 8.34 18.27
8 2 14.00 15.87 8.58 22.52
10 1 18.00 18.30 9.51 27.03

This, by itself would have been enough to differentiate BsR from RC.  But BsR goes further:

2. It makes the B/(B+C) construction so that it makes it a true rate (bounded by 0 and 1).  Imagine that: a run estimator that says that you can’t score more runs than there are runners on base, and you can’t score fewer than 0 runs.

As shockingly simple that that statement is, Linear Weights and RC fail to satisfy this condition.  Proof?  READ THE ARTICLE I linked to above.

***

I’ll also say as the biggest LWTS fan then, now, and forever, I was totally against even seeing David’s formula to begin with.  I finally relented, and boy was I glad I did.  Such a simple little construction that is rooted in logic (a HUGE plus for me), and works in extreme game scenarios, and works in normal game scenarios.  There’s nothing not to like about it.

BsR is not my baby, but I’ll champion it as much as I do LWTS.


#69    Eric Walker      (see all posts) 2009/12/03 (Thu) @ 20:51

As I said in the original article that started all this brouhaha, there really are only two basic approaches, multiplicative and linear, and members of each family will necessarily bear strong resemblances to one another, whether derived from others or independently arrived at.

The multiplicative approach, which TOP, RC, BsR, and others share, is loosely probabilistic, and reckons the chances for a batter becoming a run as the independent chances for his getting on base (and not then being thrown out there) and then for being advanced around the bases, the result naturally being a multiplication of those chances.  All the rest is detail.  Speaking of an “original” approach to reckoning run-scoring is like speaking of an “original” equation for the force of gravitation.

But the processes of run-scoring are less clear than those of gravitational attraction, and probabilistic to boot, so formulations varying in detail remain “competitive”.  As I first wrote many a year ago, the chief difference from one multiplicative formulation to another is the reckoning of the elements involved in advancing runners, and of their relative significances.  And the only test I know of for such formulations is the extent to which they can correctly describe the phenomena they purport to describe.

Most every multiplicative formulation will have a leading term that is the probability of reaching base (or number of men who did, same principal); some will allow in that term for the chances of subsequently being thrown out on the bases, others will let that come out in the wash, so to speak, by some multiplier constant elsewhere.

When it comes to the advance factors, one of the things about the multiplicative approach is that the coefficients for certain of them will not represent merely their relative weight in advancing runners; that is because in the on-base portion, all reaching-base events are treated equally, but obviously a man who reaches base by a triple will be much easier to advance to a run than one who reaches on a single--thus, the coefficient for triples in the advance factor will be artificially blown up to allow for that as well as the actual advance value of runners in general.

The thrust of the comments here, as I take it, is that a run-predictor, to be thought likely to be describing the actual processes of scoring, needs to work well at the atomic level of scoring, the inning.  It is hard to imagine a formulation that works well at that level not being a likely description of how runs score, as it is impossible to imagine one that does not work well at that level being such a description.

Over some 389,042 innings, both BaseRuns and TOP quite accurately predict runs scored per inning, with a difference between them of about .009 run/inning.  I would be interested in applying both over longer periods than 9 seasons’ worth, but I suspect there would be no earth-shattering changes.  What all that may signify is open to discussion, but so far as I can honestly see, both formulations seem to well describe the processes of run-scoring at all levels, from the atomic through the seasonal.

My understanding of the BsR methodology comes mainly from this page:

http://gosu02.tripod.com/id108.html

(I have a few quibbles with what it says there.  For example, if we know how many outs the team made, we always know the exact number of base runners thrown out on the bases, because PA=R+LOB+Outs, so R+LOB=PA-Outs, and R+LOB is exactly the net number of base runners, so we have an exact “A” factor without need for empirical measures.  But on the whole, it is lucid and reasonable.)

Now I realize that that is not Mr. Smyth speaking.  But I abstract this remark from there:

An important note here is that the use of B/(B+C) is not an inevitable one. Any formula that accurately estimates the percentage of baserunners that will score could be used. However, the basic B/(B+C) model developed by Smyth is the most accurate currently known. It may well be possible to improve the accuracy, but it would probably involve a much more confusing or expansive formula. The important point is that B/(B+C) is used because it has been empirically shown to work.

OK, TOP uses a somewhat--not drastically--different model, not (I feel) notably more confusing or expansive. (Though what is wrong with “expansive” is unclear to me.) Indeed, I could do more, fooling with IBB vs. UBB and SO, for example.  But just as is, it is perfectly competitive with BR.  So, again, why all this angst?


#70    Zach      (see all posts) 2009/12/03 (Thu) @ 21:15

Eric/64: Although it seems very small, a .00944 runs-per-inning difference is roughly 13.8 runs per season per team, about 1.3 wins.


#71    Patriot      (see all posts) 2009/12/04 (Fri) @ 00:18

OK, TOP uses a somewhat--not drastically--different model, not (I feel) notably more confusing or expansive. (Though what is wrong with “expansive” is unclear to me.) Indeed, I could do more, fooling with IBB vs. UBB and SO, for example.  But just as is, it is perfectly competitive with BR.  So, again, why all this angst?

The TOP formula that you used in the article just a week or two ago did not include any special treatment of the home run--it was a A*B/C model just like Runs Created.  Now you’ve split the homer out, which means that you are incorporating one of the two main structural advantage of David’s Base Runs model (the other being capping the score rate at 100%). 

So what we have here is a newly proposed model, that doesn’t cap score rate at 100%, and has not (as of yet, at least) presented a compelling reason to supplant the existing model.  I’m not sure what kind of response you’re looking for, exactly.


#72    Eric Walker      (see all posts) 2009/12/04 (Fri) @ 06:43

Although it seems very small, a .00944 runs-per-inning difference is roughly 13.8 runs per season per team, about 1.3 wins. That’s true, but in reality at the team-season level TOP actually works better than BsR.

The TOP formula that you used in the article just a week or two ago did not include any special treatment of the home run True.  As I said somewhere far upthread, I was and am still occasionally poking around with it.  I made two changes at the same time (which, in retrospect, was unwise): I added opponents’ errors into advance events on the same basis as a single, and I separated out the home run.  The results were a hair better--not drastically--and I have never gone back to see which change did what.

That is not to diss the idea of pulling out the home run for special treatment: whoever first thought of that, whether Mr. Smyth or another, gets points.  It probably makes more difference at the inning level anyway.

But as to the main “structural advantages”, what do we make of the previously quoted remark that the crux is the method of estimating the percentage of baserunners that will score?  Is that a misrepresentation?  As to capping, applying simple, brute-force capping (runs cut off at the number of R+LOB or boosted to prevent more than 3 LOB/inning) made about .001 of a run per inning difference, which scarcely seems crucial.  (And I did not include that: run “capped”, TOP is more like .008 of a run behind.)

Is there a compelling reason to “supplant” BsR?  Hell, no, and I’d like to see where someone thinks I said that there is.  My point, yet again, is that there are two basic methodologies--multiplicative and linear--and in the better versions of each no dramatic differences in performance.  I did and do say that the TOP method seems closely comparable in accuracy to BsR (and XR), a hair worse at the inning level, rather better at the team-season level, and with suggestions that it is better at the game level.  (Game DB, anyone?)

The only “response” I’m looking for is to not have TOP flippantly dismissed with snarky remarks as not belonging in the same room with the fair-haired, blue-eyed Princess.  I didn’t start this thread.  I have immense respect for Tom Tango, and when people email me asking how and where to find out more about analysis, my unfailing answer to them has been, is, and will continue to be “Buy The Book.” But apparently my casual citation of his support of BsR--models the reality of the run-scoring process significantly better than any other run estimator--followed by “we shall see”, got right up his nose.  Well, at the team-season level, BsR doesn’t.  At the game level we can’t be sure (or I can’t, anyway, with such data as I have), but a random sampling suggests that it doesn’t.  At the inning level it does, at least for a restricted era, but minutely.

While I have what Nero Wolfe calls “a robust ego”, it is still no great pleasure to read screed after screed implying that one is a dunce, and a mean-spirited one at that, when the hard numbers fail to support at least the first part of that characterization.  And, I repeat, it didn’t start with me.


#73    Tangotiger      (see all posts) 2009/12/04 (Fri) @ 07:32

Any equation will work just as well as the other, when it comes to typical baseball innings, games, seasons, teams, players, and whatnot.  That’s because there is not such great variation.  And in the cases where there is great variation, the frequency of those is so little as to not make a big difference.  A great run estimator gets an average error of 18 runs per team while a bad one that has no basis in logic other than “bigger is better” will get you 25 runs.  Big deal, frankly.  Just use OBP and SLG and be done with it.

But, each and everyone exhibits SOME bias of some kind.  All the multiplicative ones that don’t pull out the HR shows a bias toward giving too much run impact to the HR.  So, if you have a focus on games where lots of HR are hit, you will get a bad estimate.  This has been proven and cited above.

So, if you have two similar overall results and one shows bias and the other doesn’t, which do you take?

***

Furthermore, it is bad form to best-fit an equation to a sample, and then test it on that same sample.  This is what I keep saying.  You cannot take team-level seasons for TOP and take play-by-play for BsR, get them each best fitted for those samples, and then test both at the team-level seasons that TOP used.  This is wrong and bad.  But, this is exactly what Eric did.

***

So, my issues is: poor testing, and no attempt to figure out where the biases lie in the equations.

As a proof, change the coefficient for the triple to just about anything (from the value of the double to the value of the HR, while recentering the formula so it matches the league mean).  How much difference does it make?  This is my point about the biases and average error.  The average error will barely move (maybe by 1 or 2 runs?), but now there’s tremendous bias.  But, you can’t see it in the equation.

The test to see the bias is simply to “add 1” to a team total, see what the resultant difference in runs is, and that tells you teh run impact of the event.

And if the formula says that the run value of a double minus a single is .15, this is wrong.  !00% totally wrong.  How do I know this?  Because the chance of a guy scoring from 1b is around .26 and it’s around .43 from 2b.  Just the getting on base portion is .17 of a difference.  And what about the moving runners over?

So, my “angst” is to continually looking at the average and standard errors as if we’ll learn something.  We learn almost nothing here.  Let’s get past this average error business.  Let’s get to logic.

I’ll ask a simple question: on average, how many runs is a double worth more than a single?

Question 2: And what does TOP say?

Answer those questions, this thread moves forward, and my angst is gone.

If not, then we’re stuck in quicksand.  And, I’m not happy being in quicksand.

***

Thanks for the kind words!  I believe I’ve said some nice things about the Sinister book by Eric.  You should republish it (Lulu.com, wowio.com).


#74    Tangotiger      (see all posts) 2009/12/04 (Fri) @ 10:21

Btw, kudos to Eric for bearing with me.  I know I can be tough, but if you continue to grant me some latitude, the payoff will be there.

***

I remember Howard Stern taking a call from a new listener, and she said she couldn’t stand him, that he was terrible. And Howard agreed. 

Paraphrasing, he said:
“Do me a favor.  Listen to me for 2 weeks.  You will be disgusted, you will hate me.  You will think I’m a terrible person.  And by the end, you will be laughing, and you will tune in forever.”

Howard is great.  Once you give in, the light turns on. 

***

That’s all I ask, Eric.  Answer my questions, follow me a little bit, and then, if you finally get what I get, all the better.  If not, then at least you gave me your best effort, and we can each go on our own merry way.


#75    Patriot      (see all posts) 2009/12/04 (Fri) @ 12:17

I ran Eric’s new version through the +1 methodology, as best as I could (these are for the 2009 AL).  This was for the one based on team seasonal data, although I don’t think using the inning version would change things much as the coefficients are very similar:

.57 S, .71 D, 1.10 T, 1.42 HR, .37 W+HB, -.19 SH, -.116 AB-H+SF, .22 SB, -.35 CS, .69 E


#76    Tangotiger      (see all posts) 2009/12/04 (Fri) @ 14:46

”.57 S, .71 D, “

Eric: what Patriot is saying here is that the implied run value of the single is .57 according to TOP and it’s .71 for the double.  That’s a difference of .14 runs.  Does this not bother you?

(IIRC, Bill James’ latest version of RC has a ridiculously high run value for the single as well, presumably because he best-fitted against more recent team-level seasonal data.)

And I don’t know how the error’s run value can almost match that of a double either.  But that’s what TOP is saying.


#77    Patriot      (see all posts) 2009/12/04 (Fri) @ 15:28

Tango, the ROE value is a product of a very clumsy and careless presentation on my end.  I didn’t split everything into mutually exclusive categories.  The .69 value is right, but I still had it counted in the AB-H value.  So the error is actually .69 + -.12 = .57--the same as the single.  Eric gives each of them a 2.39 weight in the advancement factor. 

I should have caught that.


#78    Tangotiger      (see all posts) 2009/12/04 (Fri) @ 15:47

Great, thanks.


#79    terpsfan101      (see all posts) 2009/12/04 (Fri) @ 16:07

Eric, I just checked my work and I don’t believe there are any errors in the formulas I tested. The version of TOP I used was the TOP no error data formula posted in the baseballanalyst article. The .2 runs difference we get from Baseruns could be due to the fact that I used the sum of individual stats grouped by team, rather than stats from the Teams table of the Baseball Databank database.

To get the error for each game/team-season, I took the absolute value of estimated runs minus actual runs scored. To get the average error, I divided the sum of errors by the number of games/team-seasons.

Eric, if you would like a games databse, download the gamelogs from Retrosheet, and I will find you the link to a database shell that was posted by Brian Cartwright some time ago.


#80    terpsfan101      (see all posts) 2009/12/04 (Fri) @ 16:19

Eric, here is a database shell for the Retrosheet gamelogs:

http://www.mediafire.com/?nkoyt2ljmmz

I also recommend that you download this program which merges multiple text files into a single text file. This way you don’t have to import the gamelogs one year at a time.

http://bluefive.pair.com/txtcollector.htm


#81    dave smyth      (see all posts) 2009/12/04 (Fri) @ 20:24

Tango/68....

Thanks for the kind words. ...


#82    Eric Walker      (see all posts) 2009/12/04 (Fri) @ 23:15

Just a brief note to thank the several people who have posted since my last.  At this moment, I haven’t time to do anything about either answering questions or using the DBs, but I hope to this weekend, or maybe even late tonight, and look forward to it.  (But I’ve got to get my Christmas shopping done.) I just didn’t want to leave the impression that I am not grateful for them.


#83    Brian Cartwright      (see all posts) 2009/12/05 (Sat) @ 03:14

terps, that’s what command line ‘for loops’ are for

out this into a batch file

for %%f in (53 54 55 56...) do call cwevent…

then you will have 50+ text files

copy retro*.csv all.csv

will also do the trick...then import the ‘all.csv’ or whatever you care to name it into the db


#84    terpsfan101      (see all posts) 2009/12/06 (Sun) @ 05:55

Thanks for the tip Brian. I’ll try it out next time I process Retrosheet files. I barely know how to use the command line. I got lazy when Windows 95 came out, and have pretty much forgot all the DOS I learned as a kid.


#85    Eric Walker      (see all posts) 2009/12/06 (Sun) @ 23:05

Oops, sorry: meant to mention that the linked software is Windoze stuff, and I run on Ubuntu Linux.  Gaze into the crystal: I see Retrosheet downloads and processing in my future. . . .


#86    Brian Cartwright      (see all posts) 2009/12/06 (Sun) @ 23:21

The power of Windows is that I can have multiple command prompts open at the same time!


#87    Eric Walker      (see all posts) 2009/12/07 (Mon) @ 19:20

Hmm.  I thought I had put up a substantial post yesterday evening, but checking back, I don’t see it.  Might it still somehow be pending, or otherwise recoverable?  I can re-make it, but that will take me a while.


#88    Eric Walker      (see all posts) 2009/12/07 (Mon) @ 19:28

I meant to add: if anyone wants the exact team-seasons database I’m using, it’s available, compressed in ZIP format, at:

http://highboskage.com/From1955.ByTeam.zip

Or uncompressed as:

http://highboskage.com/From1955.ByTeam


#89    Tangotiger      (see all posts) 2009/12/08 (Tue) @ 00:32

Eric,

Sometimes, rarely, this software eats up posts.  Invariably, it’s because of links. 

There;s nothing in the queue.  I remind people to copy/paste before posting.


#90    Eric Walker      (see all posts) 2009/12/11 (Fri) @ 08:00

I finally found a little time for some more work, so here are results.

I combined two goals here.  First, as I was asked for “+1” values, I have supplied them.  Second, as there was criticism of testing a data batch with coefficients developed on that batch, I undertook two other runs: first, I optimized coefficients (for both BsR and TOP) using data from 1955 - 2004, then used those coefficients to run against data from 2005-2009; then I sliced off the other end, getting the coefficients for data from 1960 - 2009 and using them on data from 1955 - 1959.  Thus, we have two runs over 5-year blocks using coefficients not derived from those blocks or any period containing those blocks, so there is no question of custom-fitting coefficients to particular data.  The coefficients were garnered over 50-year spans, including three different baseballs and shifting prevailing emphases on small-ball and big-inning strategies, yet applied to 5-year spans lacking such diversity (a worst-case approach).

Here are the overall 1955 - 2009 results, plus those from those two separated runs:


Test run on 1955 - 2009 using coefficients optimized for 1955 - 2009
====================================================================
Method  Error Pct.  Error Size        "+1" values
        quantized     runs       bb    1b    2b    3b    hr
-----------------------------------------------------------
 TOP     2.308       15.933    0.36  0.55  0.68  1.06  1.41
 BsR     2.684       18.629    0.33  0.49  0.80  1.11  1.45


Test run on 2005 - 2009 using coefficients optimized for 1955 - 2004
====================================================================
Method  Error Pct.  Error Size        "+1" values
        quantized     runs       bb    1b    2b    3b    hr
-----------------------------------------------------------
 TOP     2.364       17.907    0.37  0.55  0.70  1.09  1.42
 BsR     3.411       25.527    0.35  0.50  0.81  1.12  1.44


Test run on 1955 - 1959 using coefficients optimized for 1960 - 2009
====================================================================
Method  Error Pct.  Error Size        "+1" values
        quantized     runs       bb    1b    2b    3b    hr
-----------------------------------------------------------
 TOP     2.352       15.813    0.36  0.55  0.68  1.07  1.42
 BsR     2.674       18.150    0.33  0.49  0.81  1.13  1.46

As to the so-called “+1” values: they are, in engineering terms, reasonably close.  The ratios from one equation to the other are closer to 1 than to 2, much less an order of magnitude or some such.  If there is some a priori way of knowing which set is “righter” than another--besides the obvious differences in accuracy in the formulae generating them--I have not seen it here.  The numbers I have seen elsewhere for such values are all single-season results, which vary at least somewhat even between successive seasons (and in some cases, different analysts report different numbers for the same season), so how they might compare to 55-year numbers is unclear (at least to me).

As before, whether or not the stats are in the period optimized for, there is a consistent difference in accuracy.

But another criticism levelled was that supposedly “forced” team-season success will break down at the game and inning level because the coefficients don’t “really” represent baseball interactions.  Here are the results for the 389,042 innings of the period from 2000 through 2008, using the 55-year coefficients in both formulae:

  TOP runs error: 0.282
  BsR runs error: 0.305
So, at the inning level as at the team-season level, the difference in accuracy carries through, and in roughly the same proportion.


#91    Tangotiger      (see all posts) 2009/12/11 (Fri) @ 11:09

Eric, good job.  This is exactly what was needed to move the discussion forward.  Thanks for your efforts.

First off, can you publish the actual equations from the first table you did.  A run value of 1.45 for the HR for BaseRuns is not “optimized”.  I presume you limited yourself to TB, and H, and BB, and didn’t separate the TB into D, T, HR as the reason.  The run value of the HR needs to be pretty much on 1.40.

Secondly, can you also post your exact dataset that you used?  You can email it to me, and I can post it on my site if you like.  Or post it on your site.  This is so that we can all be working off the same dataset.

Finally, are you prepared to accept that the run value of the single is .55 runs and the double is .68 runs as being logical?  Or, are you saying it is obviously illogical, but you are accepting that it’s telling us something more that requires you to keep it that low?


#92    terpsfan101      (see all posts) 2009/12/11 (Fri) @ 19:14

Eric is probably using the version of baseruns that he posted in his Baseball Analyst article. I think he is also reconciling the B factor to equal runs scored. That is why he is getting a LW value of 1.45 runs for the HR from 1955-2009. Here are the +1 values I get from 1955-2009 using Baseruns with a unreconciled B factor and a reconciled B factor

0.47 0.49 1B
0.77 0.80 2B
1.07 1.12 3B
1.43 1.45 HR
0.31 0.33 NIBB+HBP
0.15 0.16 IBB
-.09 -.09 OUT
0.19 0.20 SB
-.28 -.29 CS+GIDP
910993 955780 Bsr
955780 955780 R

Eric, I suggest you change .9*(SB-CS-GIDP) to .9*(SB+SH-CS) in the b factor, so both your TOP equation and Baseruns equation use the same variables. Then change the c factor to (AB-H+SF+SH+CS). Finally, to be fair, you need use the TOP equation that doesn’t use ROE data, since Baseruns isn’t using any ROE data in this case.


#93    dave smyth      (see all posts) 2009/12/11 (Fri) @ 20:21

Here’s how I suggest that the best comparison should be constructed.
1) For both metrics, use a version which includes only the major categories of BB, 1B, 2B, 3B, HR, and (AB-H). Including dependent outcomes like GDP or SF only serves to ‘fuzzy-up’ the direct comparison of the formula structures.
2) For the BsR B factor, find the best weights for each element by regression, and do the same for the advancement part of TOP
3) This may be redundant after #2, but make sure both formulas are precisely tuned to the same dataset.

I am not saying that regression is the best way to construct a final version, of course. But if you are comparing a formula which is free to vary the advancement weights according to a best fit vs a formula which has to hold the HR at 1.4, etc., then the former has an unfair advantage in the testing which has nothing to do with the intrinsic quality of the metrics.

IOW, make sure you are comparing apples to apples, just for testing purposes.


#94    dave smyth      (see all posts) 2009/12/11 (Fri) @ 20:31

And maybe construct the values based on the even years 1955-2009, and then test on the odd years. And similarly for tests using games or innings.


#95    Patriot      (see all posts) 2009/12/11 (Fri) @ 21:08

I constructed a test along the lines of what David suggested: I limited each model to S, D, T, HR, W, and (AB-H).  I calculated the B factor for each equation that would make runs = estimated runs for each team, and regressed that against S, D, T, HR, and W for odd years in the 1955-2008 period.  I then tested the equations against the even years in 1955-2008.  I only did team seasonal data, as I don’t a comprehensive game and inning database handy at the moment.

I realize that Eric makes some adjustments to the TOP factors, even with limited data, that have not been replicated here, so consider this a test of a generic baserunner*advancement/PA + home run model.

For the BsR model, we have:

A = H + W - HR
C = AB - H
B = 0.866 S + 1.93 D + 3.81 T + 1.97 HR + 0.174 W
RMSE = 22.84
AAE = 18.97

For the pseudo-TOP model, we have:

A = H + W - HR
C = AB + W
B = 0.899 S + 1.66 D + 2.93 T + 1.68 HR + 0.432 W
RMSE = 24.33
AAE = 18.97

I also ran a Runs Created model, just for the heck of it:
A = H + W
C = AB + W
B = 0.981 S + 1.50 D + 2.83 T + 3.69 HR + 0.385 W
RMSE = 24.38
AAE = 19.14

What’s interesting here, I think, is that all of the models are close in terms of average error, but BsR has a little separation when RMSE is the standard. 

I’ll post the +1 values if I get a chance.


#96    Patriot      (see all posts) 2009/12/11 (Fri) @ 21:21

These are the +1 weights based on the 1955-2008 totals, listed as (S, D, T, HR, W, O):

RC: .56, .73, 1.16, 1.44, .37, -.12
pseudo-TOP: .49, .72, 1.10, 1.41, .35, -.09
BsR: .50, .72, 1.11, 1.41, .36, -.10

TOP and BsR are pretty much in agreement on this count.


#97    Tangotiger      (see all posts) 2009/12/11 (Fri) @ 21:33

Right, the best-fitted TOP and the best-fitted BsR both give the same average error and both give the same (implied) run values to each event. Good job.

So, if you want the best-fit BsR that gives you a ridiculous value for the double, use what Patriot is showing.  If you want a less well-fit BsR that is more logical, use what we’ve always been using.


#98    dave smyth      (see all posts) 2009/12/11 (Fri) @ 22:09

Yes, good work from Patriot. The purpose of that approach is to try to separate the fitting procedures from the formula structure. It appears that, on the MLB team level, there isn’t really much difference between BsR and TOP. What about games or innings? We know that at some point there has to be a big difference, due to TOP not adhering to the same constraints that BsR does. But where is that point? Is it at or before the average inning level, which is the most ‘natural’ place to care about this distinction? Or is it only in some small subset of the most extreme innings?

It would be nice to know the level at which the theoretical advantages of BsR actually make a difference.


#99    Eric Walker      (see all posts) 2009/12/11 (Fri) @ 23:41

1. As the holiday season moves along, I find less and less time daily for further work.  But I will try to use that odd-year/even-year split sometime soon, as it’s not hard to implement (though mildly tedious to get the optimized coefficients).

2. The team-seasons database is available both raw and zipped:

http://highboskage.com/From1955.ByTeam
http://highboskage.com/From1955.ByTeam.zip

3. BsR was “optimized” only by finding the B multiplier (usually stated as 1.10) that gave the lowest average percentage error over the span being optimized.  The values turned out to be:

1955 - 2009 (inclusive): 1.0889
1955 - 2004      "     : 1.094826
1960 - 2009      "     : 1.086594

As with all coefficient optimizing I do, the test projections are quantized, and the optimizing stops when further significant digits cease affecting accuracy (owing to that quantization).

I’m confident that the accuracy of BsR would be improved were I to try optimizing all the numerical coefficients, and if time ever allows, I will try doing just that.  I didn’t before because of the published statement that “If you want to tailor a version to a particular dataset . . . all you have to do is determine the overall B multiplier.”

4. Here are the exact formulae I am using:

BsR:
  PartA = h+bb+hb-hr-(0.5*ibb)
  PartB = Kbr*((1.4*tb)-(0.6*hits)-(3.0*hr)+(0.1*(bb-ibb+hb))+(0.9*(sb-cs-gdp)))
  PartC = ab-hits+cs+gdp
  PartD = hr
  Runs = PartA*(PartB/(PartB+PartC)))+PartD
   (where Kbr is the optimized coefficient)

TOP:
  NetOB = pa-outs
  WeightedTB = (K1*(1b+Eb))+(K2*2b)+(K3*3b)+(K4*hr)
  FreePasses = ubb + hb + ci
  AdvFactor = WeightedTB+(Kbbhb*FreePasses)+(Ksh*sh)+(Ksb*sb)+(Kibb*ibb)
  AdvRate = AdvFactor/pa
  Advance = AdvRate+Kb  [slope m forced to 1]
  Runs = ((NetOB-hr)*Advance)+hr
   (where ubb is bb-ibb; outs is 3x opponents' IP; and the K's are all optimized; for 1955-2009, they are:
  K1 = 1.1942
  K2 = 1.6615
  K3 = 3.024
  K4 = 1.8862
  Kbbhb = 0.4954
  Ksb = 0.751
  Ksh = -0.239
  Kibb = 0.042
  Kb = -0.0554283124681

5. It’s a lot easier to read strings of numbers if they’re posted between “pre” tags.

6. I think the innings-level question is largely answered by the results I quoted above.  I’d like to test the full set from 1955 - 2009, but haven’t time now to go about acquiring such data from RetroSheet.  (Unless someone has a more complete innings DB than the nine years I was kindly given; but I think 9 years is a lot.)

7. As to “are you prepared to accept that the run value of the single is .55 runs and the double is .68 runs as being logical?  Or, are you saying it is obviously illogical, but you are accepting that it’s telling us something more that requires you to keep it that low?”

Those numbers are byproducts: neither inputs nor sought outputs.  Lacking any definitive method (that I know of) to say what is or isn’t “logical” for the relative sizes of such things, I am willing to accept that what falls out of the most accurate formula is what it is.

More generally, it seems to me that the point of mathematical analysis is to find out what comes from what and what leads to what.  When we start putting aside the numbers we get as “illogical”, we are back to Al Campanis and the days of “I know what I know and if the numbers disagree, I trust my gut.” Logic, to me, is that if byproduct X seems more “logical” than byproduct Y, how come X isn’t what falls out of the most accurate equation when it is used?

We’ve now, I hope and believe, settled the issue of coefficients coming from the data being analyzed and of formulae “blowing up” at the inning level (they don’t).  I don’t see that +1 values that do not align with intuition are any particular obstacle; intuition does not suggest that a few pounds of metal in a device that can be carried in the back of a pickup truck can erase a fair-sized city from the face of the Earth, but that’s what the reality is.

8. And last, at least for now, I suspect that a fully optimized BsR will be very close to TOP in final accuracies; my posts have not been to suggest some great advantage of TOP over BsR, but to meet the various objections made to the effect that the two don’t even belong on the same field.

In the end, as I have said before, reality is what it is, and attempts to model it must, in the end, converge if they are accurate.  A better multiplicative formula would have the form:

((1b+bb+hb+Eb+ci)xAdvFactor1)
+(2bxAdvFactor2)
+(3bxAdvFactor3)
+(hrxAdvFactor4)
where we know a priori that AdvFactor4 is just 1.  But, lacking--for now, till I take time to dig it out--numbers for the fraction of doubles-hitters who eventually score, and ditto for triples-hitters, we make do with lumping all base runners into an OB factor, making up the difference by extra weights for xbh in the AdvanceFactor numbers.

9. Happy Holidays to all.


#100    Tangotiger      (see all posts) 2009/12/11 (Fri) @ 23:53

"Lacking any definitive method”

Are you suggesting that the method as detailed in The Book is, if not definitive, at least highly compelling?

***

Finally, can we agree that a simulator will give us the definitive method?  Here’s one:
http://www.tangotiger.net/markov.html

It has some limitations, as noted, but those limitations, once accounted for, will not drop the run value of the double from .81 to .68!


#101    Tangotiger      (see all posts) 2009/12/12 (Sat) @ 01:10

I didn’t before because of the published statement that “If you want to tailor a version to a particular dataset . . . all you have to do is determine the overall B multiplier.”

To “calibrate” not to “best-fit”.  To best-fit, you need to do it by variable.

***

As others have noted, you must use the same parameters.  You can’t use reaching on error in one equation and not the other.  SH is in one and not the other.

***

Anyway, this version of TOP does one of the two unique things BsR does, and that is to break the HR away:

Runs = ((NetOB-hr)*Advance)+hr

So, TOP has already adopted that part of BsR.

The other part of BsR, where David does B/(B+outs), Eric does B/PA

Truth be told, that’s going to be pretty darn similar at the team level.  We’re not going to learn much here.

Eric’s already taken the big leap toward BsR already, with the HR.  And, it’s always the argument as to how to construct the Advance Rate.


#102    Patriot      (see all posts) 2009/12/12 (Sat) @ 01:41

For the sake of completeness wrt the various models, I went back and tested baserunners*advancement/outs + HR, which is Eric Van’s model in his Contextual Runs:

A = H + W - HR
C = AB - H
B = 0.712 S + 0.970 D + 2.00 T + 1.07 HR + 0.176 W
RMSE = 25.48
AAE = 19.85

+1 values: .63, .74, 1.20, 1.48, .39, -.14

I don’t recall CR looking this bad when I’d looked at it before, so take it FWIW.


#103    Tangotiger      (see all posts) 2009/12/12 (Sat) @ 02:02

Ok, so the three advanced metrics, BaseRuns, TOP, and Contextual Runs, all have one thing in common: they strip the HR away.

Good.

Now, the next part, the “advance rate” is done in 3 ways:
a. something divided by (something plus outs)
b. something divided by PA
c. something divided by outs

I’m sure they can all get optimized so they are all equivalent for team-level data.  It becomes which one is more flexible for a Bob Gibson and Jose Lima and everything in-between environment.

I have to believe that “a”, which forces the bounds at 0 and 1, is the more appealing.


#104    Eric Walker      (see all posts) 2009/12/13 (Sun) @ 22:27

I have now tuned all the BsR coefficients for maximum accuracy over the dataset from 1955 through 2009.  (As a reminder, that is as far back as one can go, because some stats, such as IBB, weren’t kept before 1955.)

I was somewhat surprised at how little improvement there was.  First, here is the equation used, showing the coefficients:

PartA = H + BB + HB - HR - (Kibb * IBB)
PartB = Kbr * ([Ktb*TB] - [Kh*H] - [Khr*HR] +
      [Kbbhb*{BB - IBB + HB}]+[Kq*{SB - CS - GDP}])
PartC = AB -H + CS + GDP
PartD = HR
Proj = (PartA * [PartB / {PartB + PartC}]) + PartD

The final values obtained for the coefficients were:

  Kbr = 1.080903  [nominally 1.1]
  Kibb = 0.1752  [nominally 0.5]
  Ktb = 1.39304  [nominally 1.4]
  Kh = 0.59812  [nominally 0.6]
  Khr = 3.3475  [nominally 3]
  Kbbhb = 0.16497  [nominally 0.1]
  Kq = 0.6294  [nominally 0.9]

The consequent results for the full data set were:

Error Pct. = 2.60461625195
Error Size = 18.0521292217

  +1 bb : 0.34
  +1 1b : 0.49
  +1 2b : 0.80
  +1 3b : 1.11
  +1 hr : 1.37

For comparison, these were the results with nothing but Kbr optimized:

Error Pct. = 2.68385581113
Error Size = 18.6292217327

  +1 bb : 0.33
  +1 1b : 0.49
  +1 2b : 0.80
  +1 3b : 1.11
  +1 hr : 1.45

Something I noticed is that the various coefficients seem much more entangled and sensitive than those for TOP.  It took me many dozens of iterations to get the optimum values, and in each iteration some tiny change in one needed larger changes in the others to restore the optimum error percentage, which settled very, very slowly (I was fighting hundredths and, often, thousandths of a percent through numerous iterations).

For comparison, I re-did the TOP Ks, initially setting each to simply 1.  Here are how the iterations proceeded:

 after 1st: 2.42195997146% average error
 after 2nd: 2.35940453291%    "      "
 after 3rd: 2.33056148940%    "      "

 best ever: 2.30764004983%    "      "

That is, for TOP a single pass produced coefficients that were at least workable, whereas for BsR I found the Ks tending to oscillate a good deal.  What--if anything--that demonstrates, I have no idea, but I mention it as a phenomenon noted.

I also tried the odd-year/even-year approach (optimize coefficients from one set, then apply them to the other), but I need to double-check a few things before reporting on it.  But the preliminary indication is that there is no great change.

My philosophy about the coefficients in any equation is that they have a definite optimum value that does not change over time, and expresses the best a given equation can do.  When we optimize those coefficients for any given data set, we are only making an approximation to the true best set.  If that be so, then clearly the largest data set will give the best approximation of those coefficients, just as the average of all physics experiments gives us the best approximation of the Planck constant, which we do not believe changes from time to time or place to place.

I thus find questionable the assertion that using coefficients derived from the largest data set available to us is somehow “cheating” when some smaller data set, which is a subset of the largest, is evaluated with such coefficients.  I have, as posted earlier, shown that coefficients from a large dataset, when run against a smaller not part of that original set, still work satisfactorily, and that the small relative advantage of one equation over another remains fairly consistent; I would hope that that substantiates my belief.

As to the question of a suitable denominator for fraction-scored factors, while I can see the appeal of one that self-limits to 0 and 1, it remains so that we are dealing with probabilities, and so that factor should be expressing a probability.  Probability is events divided by opportunities; in baseball, the opportunities are simply all plate appearances.

(Some lesser factors are situational, meaning that they are possible only in some subset of PA; SH is an example.  But when we derive a coefficient for SH/PA, we are really just deriving the product of two separate coefficients: the runner-advance weight of an SH when it is possible, and the fraction of PAs in which it is possible.  We need not know each individually: only their product concerns us.)

That occasionally, at the atomic (inning) level such factors can, given certain weighting coefficients, exceed 1 is an issue, but imposing a simple brute-force limit of 1 seems to yield satisfactory practical results; moreover, it does not make a huge difference.  Here are those nearly 400,000 individual innings’ results:

Uncapped: 0.283126599758 run average error
  Capped: 0.281321013483  "     "      "
That’s 0.001805586 of a run average difference, about 0.64% of the average; obviously, it is pretty rare even at the inning level for a PA-based factor to produce impossible values.

(The cap puts a ceiling on by not allowing R greater than actualal OB; it also puts a floor by not allowing LOB over 3 per inning.)

That’s probably all I can do for a while, as Christmas draws nigh and there are things to be done and free time wanted.


#105    Patriot      (see all posts) 2009/12/13 (Sun) @ 22:55

Eric, there’s nothing about BsR that demands the coefficients to take the form that you are using, namely:

[Ktb*TB] - [Kh*H] - [Khr*HR] + [Kbbhb*{BB - IBB + HB}]+[Kq*{SB - CS - GDP}])

Using this construction for the B factor locks the relationship between the hit types in to only what is reflected in differences in TB. And it forces SB coefficient = CS coefficient = GDP coefficient. It’s little surprise that with these constraints, the accuracy doesn’t move at all.


#106    Tangotiger      (see all posts) 2009/12/13 (Sun) @ 23:06

Yes, I don’t understand why Eric would have forced those conditions, using TB instead of breaking it down to S, D, T, as I noted earlier.  I hadn’t even noticed the SB,CS, GIDP.

Eric: why didn’t you just let it each coefficient be separate?

And, as noted elsewhere, TOP uses reaching on error and sac hits and his version of BsR doesn’t.

This is my version of BaseRuns:
http://tangotiger.net/bsrexpl.html


#107    Eric Walker      (see all posts) 2009/12/14 (Mon) @ 07:35

For clarity: all I did was take the original formulation, as widely published, and convert all set number values to coefficients that were individually tuned over numerous iterations.

I did not try to alter the original form, such as by breaking TB down to individual types of hit or by adding in other stats.  Do enough of that and eventually you have something which is no longer BsR but something new.  It is not my product and I didn’t feel it right to tamper with it.

My tuning method (for any formulation) is almost risibly brute-force simple: start with some nominal value for each coefficient, then (usually in arbitrary order) find by progressively narrower trial-and-error runs the value for each that gives the lowest average error; do each coefficient in turn and that is one iteration.  Repeat iterations till accuracy can no longer be improved (there is a limit owing to quantization of the runs figures to integers).

(The individual coefficients are, in a given iteration, taken to as many significant digits as make a difference to quantized accuracy.)

For BsR, I started with the nominal published numerical coefficient value.  (For TOP, as the previous post showed, just starting with 1 for everything quickly leads to usable values.)

After all that, over the 55-year period, BsR’s improved average error was about 18 runs (18.052), while TOP’s is about 16 runs (15.933).  Adding more stats in might bring it to comparable numbers; as I have said, this is not about some inherent superiority of method--both seem, to me, very good.

TOP itself is not (yet) complete as to stats: opponents’ wild pitches, balks, miscellaneous errors (those other than the ones that allow an otherwise-out batter to reach base), and perhaps some other things are reasonably available stats and might add to the accuracy.


#108    Tangotiger      (see all posts) 2009/12/14 (Mon) @ 08:01

"as widely published”

That one is not the “widely published” one.  There actually is no “widely published” version.  If you click on the link I posted, you will see that I did list each item separately, all with a nice and easy-to-use table, and it is still BaseRuns. 

PLEASE, for all your future testing, use the exact same variables in TOP that you use in BsR.

It may very well be that the “something/PA” (TOP-based BsR) is a better construction than “something/(something+outs)” (BsR).  The “advance rate” is where all the fun happens.

As long as TOP (or RC if Bill James ever listens to us) breaks the HR away into its own term as BsR does it, then now every equation is now part of the BsR family.  The original formulation of TOP did not do that.  Eric’s current formulation does.  So, on that basis the BsR-derived TOP could be the better equation.  I don’t know.

But, having the bounds of 0 and 1 that BsR offers is certainly something very appealing. 

Until Eric (or whoever) tests all equations with the same components and the same in-sample best-fit coefficients and the same out-of-sample test cases, we won’t know which is more accurate for a particular dataset.


#109    Eric Walker      (see all posts) 2009/12/14 (Mon) @ 21:19

Re There actually is no “widely published” version. If you ask Google for results on base runs, the first three hits up are:

http://en.wikipedia.org/wiki/Base_Runs

http://gosu02.tripod.com/id108.html

http://www.tangotiger.net/wiki/index.php?title=Base_Runs

Each of them shows the BsR version I have been using.  If that is not a “widely” published version, it is at least a multiply published one.

I would be pleased to level up, as best I can, the playing field, but as I said I don’t want to seem to be calling some variant I create “Base Runs”.  If you want to either post here or email me an exact BsR version to use in testing, with whatever stats you want added or broken out, I will--as holiday time allows--be glad to give it the same coefficient-derivation treatment as the others and report back here.

As to breaking out home runs: as I have said before, a proper equation would have four separate advance factors, one each for batters initially reaching each of the four bases; it is just easy to see that the HR advance factor--the probability of advancing a runner initially reaching home on to home--is necessarily 1.  But ideally, and perhaps really in the future, all four cases of initial base reached should be separated out, so the advance factors are really just that (whereas right now, in most equations, the 2B and 3B factors effectively have “boosts” to allow for the greater ease of scoring from those bases than from first).


#110    Tangotiger      (see all posts) 2009/12/14 (Mon) @ 23:35

"If you want to either post here”

I already did in post 106.


#111    Eric Walker      (see all posts) 2009/12/15 (Tue) @ 07:48

Sorry, I imagine I’m being dense here, but where would I get those data for the period 1955 - 2009?  The data I presently have in my db are:

G
R
PA
AB
1B
2B
3B
HR
BB
HBP
SB
CS
SO
SH
SF
GDP
IBB
CI
Eb (errors leading to an otherwise-out runner reaching base, not all E)
Outs (opponents’ IP x 3)
LOB
OR
W

I don’t have, but ought to, and will soon be seeking:

WP
Bk
PB

But the rest I wouldn’t know where to get till I take the time to create some software for extracting from RetroSheet Event Files, which may be a good while yet.

I suppose that if I’m going to take those events and try to tune for better values, I could just omit what I don’t have.  It’s late now, and I have a dental appointment tomorrow.  I’ll ponder on it.


#112    Brian Cartwright      (see all posts) 2009/12/15 (Tue) @ 07:56

For extracting data from Retrosheet event files, Retro provides BEVENT, while Ted Turocy’s CWEVENT provides additional fields.

Either parses the play codes into a .csv file which can be imported into a spreadhseet or database.


#113    Tangotiger      (see all posts) 2009/12/15 (Tue) @ 08:06

Eric, just use the SAME components you use for TOP.  I’m showing you that it is “approved” to split the components out for BaseRuns (which you had a concern on), and I’m showing you how I do it with a complete data set.

My step-by-step for creating a REtrosheet database is here:
http://www.tangotiger.net/wiki/index.php?title=Retrosheet_Database

If you google
Colin Wyers Retrosheet database
you will get an even easier way.


#114    Eric Walker      (see all posts) 2009/12/16 (Wed) @ 22:32

OK, before I proceed to the slogwork, let’s see if we are in agreement.  What I propose to try is this:

PartA = H + BB + HB - HR -(Kibb x IBB) + Eb;
WeightedTB = (K1 x [1B + Eb]) + (K2 x 2B) + (K3 x 3B) + (K4 x HR)
PartB = Kbr x (WeightedTB - 
         [Kh x H] - 
         [Khr x HR] + 
         [Kbbhb x {BB - IBB + HB + CI}] + 
         [Ksb x SB] - 
         [Kcs x CS] - 
         [Kdp x GDP] +
         [Ksh x SH] +
         [Ksf x SF])
PartC = (AB - H) + CS + GDP
PartD = HR

A few comments and questions:

1. For PartC, would it not make more sense to just use total Outs made?  (That is, 3 x Opponents’ IP)

2. In PartA, what is the rationale for subtracting some weighted fraction of IBB?  In an advance factor, that makes sense, as IBBs have less advance weight that UBBs, but it puzzles me in what otherwise appears to be a raw OB factor.  Would not simple NetOB work better?  (See #4)

3. In PartB, UBB are used, but IBB not at all; is this a sort of “make-good” for the reduced IBB value in PartA?  To me (and in TOP), it makes more sense to include all BB in the equivalent of PartA and use UBB and a down-weighted IBB in the equivalent of PartB.

4. I am puzzled by the use of CS in PartB; one would think it should go somewhere in PartA.  The TOP equivalent of PartA is simply NetOB; would that not work in BsR?  Ditto GDP.

5. The constants I have labelled Kh and Khr don’t seem to mix ‘n’ match with the constants in what I have called “WeightedTB”.  I suspect that the portion of PartB given by--

($K1 x [1B + Eb]) + (K2 x 2B) + (K3 x 3B) + (K4 x HR) - 
(Kh x H) - (Khr x HR)
--is duplicative, and perhaps ought to be replaced by just WeightedTB (with whatever weights fall out from the optimizing.

6. All in all, it looks as if the chief difference between TOP and BsR is that TOP relates events to PA, while BsR relates them to advances+outs (the rationale for which escapes me).

Thoughts before I proceed?


#115    Eric Walker      (see all posts) 2009/12/16 (Wed) @ 22:34

Sorry, meant to mention: I use Linux, so Windoze-based programs are useless to me.  DOS-based might be made to work under an emulator, but all in all when the time comes I’ll roll my own, probably using PHP, which is mostly platform-independent.


#116    Colin Wyers      (see all posts) 2009/12/16 (Wed) @ 23:31

Eric,

Chadwick is available on Linux. You simply have to compile it yourself from source. (In fact, I had to install a bash shell on Windows to compile it for Windows, so it’s fairer to say that it’s a Linux program that’s also available on Windows than the other way ‘round.) I’m pretty sure that Ted Turocy, who wrote it, uses Linux as well.


#117    Patriot      (see all posts) 2009/12/17 (Thu) @ 01:06

Some feedback on Eric’s bullet points:

1. That would be one way to do it, if you’re going to use all outs in C.  Obviously the published versions are

2 & 3. Not all BsR constructions handle the IW with a fractional on base weight; Tango’s doesn’t, for instance.  When used, the fractional weighting is an attempt to avoid having a heavy negative weight for walks in the B factor (it’s -.48 in Tango’s version). 

4. There are a number of different ways that people have constructed the various factors in BsR, and there’s no consensus on which option is best.  Some approaches use what one could call “initial” baserunners in the A factor, ignoring the outs we know to be made on base later (like CS).  Others use “final” baserunners, and take out CS and GDP.  Arguments can be made either way.  If you want to set it up as final baserunners to match the construction used by TOP (and RC), that should be fine.

5. Yes, this is what Tango and I have been trying to get across for the last X posts.  Forget TB and H; use a*S + b*D + c*T + d*HR.


#118    Tangotiger      (see all posts) 2009/12/17 (Thu) @ 07:55

Ditto what Patriot said.  Ditto what Colin said.

The ONLY difference with the new version of TOP and any of the many versions of BsR is that TOP does something/PA, while BsR does something/(something+outs).

The “rationale” is one where the bounds are set to 0 and 1.  BsR adheres to this, and TOP doesn’t.  But, seeing that the testing is going to be done at a level where we’re not going to see anything approaching 0 or 1, then I presume it’s not going to matter much.

Had you tested, say, on pitching-lines, where Gooden and Pedro and Gibson provide a MUCH better range of performances to test against, then we might see this effect.

I will reiterate again that the HUGE HUGE HUGE difference is simply in moving HR into its own term.  That TOP has chosen to do this ALREADY makes it a BsR-version.  This is the genius of BsR, that it doesn’t let the HR have a continually rising value.  It caps it.

The B version to optimize is really where all the work happens, and it’s going to be nuanced to whatever dataset you choose.

***

And, no matter what the results will be in Eric test, NOTHING can match the results from a properly programmed simulator.


#119    Eric Walker      (see all posts) 2009/12/22 (Tue) @ 22:01

Owing to the holiday season and other odds and ends of life, I have had little time to proceed with the rather tedious and time-consuming evaluation of formulae, but as a sort of placeholder post, I want to address a couple of what seem to me misapprehensions.

First is the idea that pulling HRs out and adding them back in at the end is some revolutionary stroke.  It was a clever observation--applicable to any multiplicative approach--but scarcely radical in effect.  A tuned BsR-like formulation with the HRs so treated differs from a tuned variant with HRs left in and treated like any other hit by, depending on exact formulation, something from .001 to .010 percent--not worth the price of a telegram.

All multiplicative approaches have the same basic approach: OB x Advance.  But, equally, the true form for all is (where “B1b” 1s batters reaching first base):

   (B1b x Advance1) + 
   (2B x Advance2) + 
   (3B x Advance3) + 
   (HR x Advance4)
Since Advance4 is necessarily 1, that component can be pulled out, very slightly increasing overall accuracy.  Clever, as I said, but not any radical break.

Second is the belief that BsR’s advance factor being constrained between 0 and 1 is probative of correctness.  In reality, the constraint is not a consequence of baseball relevance but rather of the the sheer mechanical form of the algebraic formulation of the net advance factor, B/(B+C).  Such a form must necessarily (so long as C is not of opposite sign to and greater than B) be constrained to fall between 0 and 1 regardless of its relevance.

Leave B as some weighted runner-advance factor, as it is in every multiplicative formulation; C could then as well be the game-day height of the infield grass expressed in millimeters, or the price of a pound of butter in Mumbai expressed in rupees and the factor is still constrained between 0 and 1. Moreover, it is true that as B goes to zero, the net expression goes to zero, and that as B becomes huge, the net expression approaches 1; but it is also true that as C, whatever it may represent, varies from 0 to huge, the net factor varies from 1 toward zero regardless of the value or significance of B.

That is not to disdain the Bsr formulation, because it does work very well; it is to point out that the reasons adduced for its success are largely irrelevant.  The true crux is the concept of making C signify outs; but I have seen no reasoning explaining why that approach should work as well as it does.

Indeed, while we’re at it, let’s notice a few other things.  Outs is just PA minus NetOB (by definition, since NetOB necessarily is R + LOB).  That means that the factor B/(B+C) can be rewritten as (B/(B + PA - NetOB) which can in turn be rewritten as B/(PA + [B - NetOB]).  In most multiplicative approaches, that denominator is simply PA, so the question arises as to why adding the term (B - NetOB) to plain PA would improve accuracy (if it does).

(Curiously, that B-NetOB is something crudely analogous to “isolated power”; in fact, were we dealing only with hits and using the simplistic 1-2-3-4 weighting of TB, it would be just TB-H.)

I do hope to get some results up before Christmas Eve; the delay is some fiddling with how to best treat the lesser situation-conditional advances (SB, CS, IBB, SH, SF), and every new idea takes quite a while to try out.


#120    Eric Walker      (see all posts) 2009/12/22 (Tue) @ 22:03

Argh.  I keep meaning to add congratulations to the Mariners and those associated with that front office.  Now I have.


#121    Patriot      (see all posts) 2009/12/23 (Wed) @ 00:30

Eric, your comments suggest to me that you are digging in with respect to the mindset that team seasonal data is the only test that really matters, and that is why there is a continuing lack of common ground here.  Otherwise, your claim that taking the HR was “scarcely radical in effect” is baffling.

The treatment of the home run by RC causes it and the other models of its type to break down whenever anything remotely extreme happens with home runs on a micro level.  This can be seen in Tango’s article which is linked under my name.  As soon as there are three homers hit in a game, RC starts seriously overestimating runs scored.

Or just consider an inning in which there is a solo home run and three outs--a perfectly normal, run-of-the-mill occurrence.  The RC model I posted in post #95 predicts that the team will score 1.23 runs--a 23% error simply because the home run was not taken out.

Suppose it was an inning with a walk, a double, a home run, and three outs.  Nothing particularly unusual about this case either.  The RC model estimates 2.79 runs scored; the BsR model estimates 2.15.  In this case we don’t have an exact theoretical answer to how many runs will be scored, but we have a 30% discrepancy, solely because of the treatment of the home run.

With regard to B/(B + C), I don’t think anyone has claimed that the form exists because of baseball relevance.  I know that whenever I’ve written about BsR, I’ve always taken pains to say that B/(B + C) is something that works empirically but that it is the biggest area for possible improvement in the model.  That being said, though, I’d consider using the rupee price of butter in Mumbai if that meant I had a model that satisfied more of the theoretical constraints to run scoring, which are baseball relevant.


#122    Eric Walker      (see all posts) 2009/12/23 (Wed) @ 01:02

I certainly intend to apply whatever I do to the inning level as well as the team-season level, but when I did that some while back, there was no great difference.  But we will see what we see, and no use in anticipating the data.

I am more or less OK with empirical coefficients, but I am not happy with “magical” formulae that “just work”.  That they do work is nice, but it’s hard to have much faith in them till we have some idea why they work (quantum mechanics notwithstanding).

But, again, we will see . . . .


#123    Patriot      (see all posts) 2009/12/23 (Wed) @ 01:17

I certainly intend to apply whatever I do to the inning level as well as the team-season level, but when I did that some while back, there was no great difference.

Unless I am missing something, your innings-level test did not include an A*B/C model--they included BsR, XR, and TOP (the new TOP, which makes the adjustment for the HR).

Colin’s study in the 2009 Hardball Times Annual did include an A*B/C model--and found a RMSE difference of .583 to .425 in favor of BsR. 

So according to his study, treating the HR specially allowed us to move from a dynamic model that was less accurate than a static linear model on the micro-level to one that is more accurate.  I’d say that’s a pretty significant breakthrough.  That it is obvious to everyone after the fact does not diminish David Smyth’s achievement in being the one to enlighten us.

With respect to B/(B+C), acknowledging that it’s an open issue is not the same as simply saying “it’s magical”.  It’s a form of good things/(good things + bad things), which is a completely reasonable construct to guess might estimate something adequately, with the nice side effect of enabling the model to comply with a theoretical constraint.  If you have a convincing explanation for why B/PA goes beyond B/(B+C) in satisfying you on this account, I’ve not seen it.


#124    Tangotiger      (see all posts) 2009/12/23 (Wed) @ 01:53

Mariners: thank you.

***
Eric said:

but I am not happy with “magical” formulae that “just work”.

Assigning a run value to the double that is +.15 runs above the single IS something that “just works”.  It has no logic to it whatsoever. 

In post 75, Patriot said:

I ran Eric’s new version through the +1 methodology, as best as I could (these are for the 2009 AL).  This was for the one based on team seasonal data, although I don’t think using the inning version would change things much as the coefficients are very similar:

.57 S, .71 D, 1.10 T, 1.42 HR, .37 W+HB, -.19 SH, -.116 AB-H+SF, .22 SB, -.35 CS, .69 E

How, for example, can a single be worth .57 runs and a SB be .22 runs, while the double is worth LESS (0.08 runs less to boot) than the two combined?  A single+SB puts the runner on 2B.  A double puts the runner on 2B… AND it advances the guys at 1b and 2b at least one more base than the single does.


#125    dave smyth      (see all posts) 2009/12/23 (Wed) @ 09:26

From Eric/119: “The true crux is the concept of making C signify outs; but I have seen no reasoning explaining why that approach should work as well as it does.”

To me, the logic is that outs are the true clock in baseball. They are the true opportunity factor. And C is meant to signify the limiting counterbalance to B advancement. When you use PA instead, you are IMO doing nothing more than letting PA stand in as a proxy for outs. The reason it works on a team/season level is only that the PA per team doesn’t have much variation (because the OBA only varies from .315 to .350 or whatever).


#126    Tangotiger      (see all posts) 2009/12/23 (Wed) @ 10:20

David is right about the lack of spread in the team seasonal lines. We are definitely going to learn almost nothing by the continued focus on team-seasonal aggregation. 

We have about 4860 team-game lines every year, or say some 150,000 - 200,000 games over the last 50 years.  Why in the world would we want to REDUCE the size of the sample, by aggregating on something like “batting team”, which itself will introduce a bias?

And if you are going to aggregate on “batting team”, why not on “fielding team”?  Why not on “starting pitcher”?  Why not on “home park”?

Why not increase the sample, and aggregate by inning instead of game?  After all, what in the world does a HR hit in the 7th inning have to do with how many runs scored in the 2nd inning?

Every time a best-fit equation is establish based on team-seasonal lines, the difference between a double and single is always less than .20 runs, when the true value is around .30 runs.  That, by itself, should stop anyone from wanting to pursue the team-seasonal approach as a way to learn something about how runs are scored.

And, as I keep saying, nothing at all can beat a well-constructed simulator.  I’ve used that simulator, and I published the results in The Book, notably in the batting order chapter, where I give the run values by lineup slot.


#127    Eric Walker      (see all posts) 2009/12/24 (Thu) @ 08:01

Here are some results of comparing a multiplicative PA-based formula with a multiplicative Outs-based formula.

In each case, the same data, from the same databases, were used, as well as identical optimizing methods.  In both cases, the linear best fit was used (a y = mx+b form).

The base formula is this:

   NetOB = PA - Outs
   WeightedTB = (K1 x [1b + Eb]) + (K2 x 2B) + (K3 x 3B) + (K4 x HR)
   Passes = Kpass x (UBB + HB + CI)
   Misc = (Ksb x SB) + (Ksh x SH) + (Ksf x SF) + (Kibb x IBB) + (Kcs x CS) + (Kdp x GDP)
   Advance = WeightedTB + Passes + Misc
   Factor = Advance / Whatever
   NonHR = (NetOB - HR) x ([Kslope x Factor] + Kb)
   Rproj = NonHR + HR

For the Q-based, “Whatever” was all outs; for the PA-based, it was all PA.  Everything else was the same, except--of course--that the various coefficients were determined separately.  A second set of coefficients was optimized for the period 2000 - 2008 and with Kdp forced to zero; that was to match the only innings database I have (which does not have DP data).

The initial results for 55 team-seasons were these (all results quantized to whole numbers of runs):

 Q-based: 2.23677516425
PA-based: 2.23345855803

 or with fewer digits:

 Q-based: 2.237
PA-based: 2.234

Not much difference: PA-based is better, but only by a small amount.

The inning results for 9 seasons (389,042 innings) were these:

 Q-based, quantized: 0.213673587942
PA-based, quantized: 0.241356974311
 Q-based, raw: 0.259583723566
PA-based, raw: 0.273973519179

   or with fewer digits:

 Q-based, quantized: 0.214
PA-based, quantized: 0.241
 Q-based, raw: 0.260
PA-based, raw: 0.274

Here, the Q-based is a little better; unlike the last runs (way upthread), made with fewer data, this time quantizing benefitted the Q-based method more than the PA-based method.  My feeling is that the effects of quantizing at the atomic level, as far as benefit, are random, and tend to average zero, but that’s only a feeling.

For those wondering how one might decide which method is better portraying what is actually going on, here is something to chew on.  These are not +1 methods--I haven’t had time to set those up, and won’t for days now--but if, for a given run (method and seasons), we take the coefficient for SB and that for CS and add them, then take the ratio of SB to that, here’s what we see:

         55 team-seasons:    9 seasons' innings:
------------------------------------------------
PA: sb%     60.84%                 53.97%
 Q: sb%     35.02%                 34.54%

That ought to be an approximation of where the break-even point for base-stealing lies, ceteris paribus.

Anyway, nothing more from me till at least after Christmas day.  A happy yuletide of your preference to all.




Here are the actual coefficients found and used:

  PA-based, 55 team-seasons:
----------------------------
  $K1 = 2.378;
  $K2 = 3.404;
  $K3 = 6.52;
  $K4 = 3.933;
  $Kpass = 1.0021;
  $Ksb = 1.3204;
  $Kcs = 0.85;
  $Ksf = 4.018;
  $Kibb = -0.4077;
  $Ksh = -0.192;
  $Kdp = -0.0265;
  $Kslope = 0.468135382894;
  $Kb = 0.0498042497648;

  PA-based, 9 team-seasons:
----------------------------
  $K1 = 2.3706;
  $K2 = 3.3022;
  $K3 = 6.38;
  $K4 = 3.96;
  $Kpass = 0.9512;
  $Ksb = 1.18;
  $Kcs = 1.0062;
  $Ksf = 4.183;
  $Kibb = 0.008;
  $Ksh = -0.282;
  $Kdp = 0;
  $Kslope = 0.472088356958;
  $Kb = -0.0502527685008;


   Q-based, 55 team-seasons:
----------------------------
  $K1 = 1.8273;
  $K2 = 2.8922;
  $K3 = 6.5;
  $K4 = 3.51;
  $Kpass = 0.2163;
  $Ksb = 1.197;
  $Ksf = 5.72;
  $Kibb =  -1.31;
  $Kcs = 2.221;
  $Kdp = 0.6901;
  $Ksh = -0.0376;
  $Kslope = 1.0729881254;
  $Kb = -0.185299352505;

   Q-based, 9 team-seasons:
----------------------------
  $K1 = 1.9025;
  $K2 = 2.8866;
  $K3 = 6.573;
  $K4 = 3.5082;
  $Kpass = 0.276;
  $Ksb = 1.172;
  $Kcs = 2.221;
  $Ksf = 5.72;
  $Kibb = -1.3;
  $Kdp = 0;
  $Ksh = -0.0177;
  $Kslope = 1.06659779649;
  $Kb = -0.184299458998;


#128    Eric Walker      (see all posts) 2009/12/24 (Thu) @ 08:04

Oops.

Of course the “Whatever” used for the Q-based method was actually:

  Advance + Outs
Haste makes &c &c


#129    terpsfan101      (see all posts) 2009/12/24 (Thu) @ 09:00

Eric, before you do any more experimenting, I suggest you read Tango’s articles on run estimation:

http://www.tangotiger.net/runscreated.html
http://www.tangotiger.net/rc2.html
http://www.tangotiger.net/rc3.html

Baseruns and linear weights are the best methods out there for measuring run creation. Why continue to tinker with an inferior, more complex, and less flexible model? I bet you if you tested Baseruns and TOP using only low OPS and high OPS games, Baseruns would have a smaller error.


#130    Tangotiger      (see all posts) 2009/12/24 (Thu) @ 09:47

I think it’s great effort by Eric to at least show that this is not the way to do it, especially if the plus 1 numbers will come out as I suspect they will.  Basically, we have to go through this process of regression to show that regression is not what we want.

And the SB/CS thing is really irrelevant.  He best-fitted it, and we expect no logic out of it.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 14:14
Pete Palmer’s new book: Basic Ball

May 25 13:18
Do pitcher’s reach back for velocity when needed?

May 25 13:04
“Why Kickstarter works”

May 25 12:51
Chad Curtis

May 25 12:40
Largest demonstration in Canadian history?

May 25 11:32
Howard Stern

May 25 11:26
Lack of hustle during a game

May 25 11:22
What sabermetrics is NOT

May 25 10:58
Rooting for laundry

May 25 02:38
NFLPA lawsuit against collusion