THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, November 21, 2011

Clay’s housekeeping

By Tangotiger, 11:13 AM

Clay notes how embarrassing, and otherwise confusing, decades old code looks like.  All I can say to that is: guilty!  When you don’t follow standards, things look so messy in a few years, that you not only try to avoid looking at the code, sometimes you just end up re-writing it entirely. 

That’s a lesson for you kids: get it right the first time, by taking your time.  That’s why what I’ve done for the last several years is include a “readme” file in every new folder I create.  It’s basically what we call a “run book”, so that if someone comes in cold, you know exactly what needs to be done, if you start from scratch.  It’s tremendously helpful.

Anyway, that’s not really the reason I linked to his article.  What caught my eye is this:

I’ve also been validating projection systems from the 2011 season. While I’m pleased with how my system (which ran with some of Nate Silver’s ideas on PECOTA, threw out some of them, replaced them with some of my own tools and approaches, resulting in a chimeric Sildavenverport monster) graded, and I was also pretty shocked at just how little difference even the most complex systems made when compared to an ordinary three-year average.

This is exactly why I created Marcel some eight years ago:
http://www.tangotiger.net/archives/stud0346.shtml

And why I thought so little of forecasting systems that I published the code so you can create it yourself.  And I thought so little of systems precisely because I spent countless hours trying to beat myself each time.  I’d come up with the basics, then think of different parameters, and trying to combine them in different ways, to improve my system.  And each time, the gains would be so negligible, that the gain was hardly worth the time. 

Even things like park factors, which I presumed would make a huge difference, hardly made a dent.  And when it came time for pure rookies (guys who never played in MLB), systems who were designed with extreme intelligence on the matter (Rally, MGL, ZiPS, PECOTA) barely were any better than if we just presumed the players were ALL THE SAME (while Marcel uses league average out of convenience, it’s better to just use the first-year average, or about a wOBA of 15 points under league average).  The rookies thing, the MLEs, is ripe for selection bias that makes it basically impossible for those systems to beat the most basic system.  Not to mention it makes an enormous difference if the rookie is going to be a reliever or starting pitcher.

It’s not like I just took some position on the matter, and am defending it.  I took this position because this is where the path has led me after countless hours spent studying this matter in as many ways as I can.  And it’s been re-affirmed when testing systems of other people smarter than me and who spent more time than I have, only for those people to be perhaps one step above Marcel.  If you need a visual here, we all started on Canal Street, and while Marcel is at Penn Station, those systems are on 35th street, and Times Square is just outside our reach.

Any forecaster honest with himself, and his readers, will attest to this.


#1          (see all posts) 2011/11/21 (Mon) @ 12:14

Even things like park factors, which I presumed would make a huge difference, hardly made a dent.

Doesn’t this imply that the way we grade forecast systems against each other isn’t telling us what we need to know?


#2    Tangotiger      (see all posts) 2011/11/21 (Mon) @ 12:25

Mike: can you expand your thought?  I’m not following you (yet).


#3          (see all posts) 2011/11/21 (Mon) @ 13:10

Park factors certainly matter for some players.  I guess they don’t matter for enough players to move the needle on an overall RMSE? Or the amount of noise they introduce is enough smaller than the overall error of the projection systems that we can’t see it?

The conclusion I would draw from what you’ve said is not “park factors don’t matter much for projections” but rather that our measurements of what matters are too crude/noisy.

I’ve had that concern for a long time about comparing projection systems.  I’m not sure of a better way to do it, but the overall RMSE comparisons just don’t tell us enough.  (And I think the fantasy ranking comparisons are likely even noisier.)


#4    Lex Logan      (see all posts) 2011/11/21 (Mon) @ 13:45

My brother mentioned one technique of doing the development work in one programing language, then re-writing it for production in a different one. This forces you to clean things up. Of course commenting is crucial. I demonstrated that to one of my tutorees recently when he could not fathom what the code he wrote the previous week did. I’ll see how religious he is about commenting going forward.


#5    aweb      (see all posts) 2011/11/21 (Mon) @ 14:01

Diminishing returns on increasingly complex statistical models sums up why many of us are happy with ERA instead of measures like FIP, SIERA, etc. Likewise with OPS being replaced by wOBA/WRC/etc. Not a perfect analogy for projection systems, but close enough…

At some point, the extra effort to create and understand a new analysis isn’t justified for some of the population. For some, if you can explain 70% of everything you need with the triple crown stats (pitching and hitting), that’s plenty. For others, you get to 90% and that’s plenty. Still others strive onwards, and they have my thanks, because sometimes I jump on board.

An aside - given only the triple crown stats for hitters and pitchers, which group of players can you evaluate the best?


#6    Tangotiger      (see all posts) 2011/11/21 (Mon) @ 14:12

Mike: ah, ok.

When I’ve published my test results, I would always look for various subsets.  For example, I’d test to see how veterans/rookies did by the various forecasting systems.  Or good/bad players.  Or regulars/bench.  And so on.

While I never tested with guys who played for the same team / switched teams, that is definitely a valid test.

***

UPDATE: Editor’s note.  Every mention of RMSE in the post below applied to Correlation, and NOT RMSE.  I had the right argument, but I used the wrong term.  Sorry for the brain freeze.

There’s no question that we CANNOT use RMSE.  RMSE does TWO things, only one of which is allowed in our testing.

The one thing it does do, and is permitted, is to set the baseline so that we are not testing who predicted that the overall league runs per game would fall or not.  That is just such a silly thing to test, that it is required to treat each forecasting system as its own universe.  Indeed, the preference is really that all forecasting systems would present results against league average because I promise you one thing: every single forecasting system uses the league average.  The last step in every forecasting system is to apply the above/below average against some “reasonable expected” league average baseline so we can all see things like a forecasted .380 OBP, rather than a +.050 OBP.

The second thing RMSE does, and should NOT be allowed, is to change the SLOPE.  An exaggerated example is to forecast everyone with a .340 to .320 OBP.  But, RMSE will then stretch that out to .380 to .280 OBP, because that’s what will best fit to the actual results.

It’s a ridiculous thing that RMSE does, and I’m always disappointed when people who test these forecasting systems allow this to happen.

***

My preferred method of testing is as follows:

1. Baseline (using differential method only)

2. Calculate the error as:
(estimated rate minus actual rate) x actual playing time

***

This of course presumes that we are not trying to estimate playing time.

The reality is that the single most important thing a forecasting system can do is estimate playing time!  And, that is the one thing most testing that is done does not do!

Given a choice between Marcel rates and Marcel playing time, or Community rates and Community playing time, I’d choose the latter.  Indeed, I would likely even choose just 2011 rates and Community playing time in 2012, rather than Marcel.  The playing time is simply that important.


#7    Tangotiger      (see all posts) 2011/11/21 (Mon) @ 14:16

aweb: I don’t like the analogy to ERA and OPS.  The only reason that those are “simple” is because someone else is doing all the work.  After all, what is an ER?  Someone else has to calculate that for you. 

Same with OPS. You need to calculate an at bat first, and that is not something that is necessarily simple.  And the same thing for a time on base, whereby you need someone to calculate what counts as a time on base.

However, had you said RA9 instead, then your point would have been perfect.


#8    KJOK      (see all posts) 2011/11/21 (Mon) @ 14:20

"Even things like park factors, which I presumed would make a huge difference, hardly made a dent.”

But I think that’s mostly because players are not changing parks enough to make a real difference.  A 2012 Marcel for Josh Hamilton may not explicitly take park factors into consideration, but since he played in the same park since 2008 that he’ll be playing in for 2012 the park factors ARE having an impact.  If players were randomly distributed to 2012 teams I think you’d definitely see that adjusting for the 2012 park would then matter.


#9    Tangotiger      (see all posts) 2011/11/21 (Mon) @ 15:31

Kevin: then I’d like to see that test, for players that did switch parks.


#10    Bill Waite      (see all posts) 2011/11/21 (Mon) @ 16:03

@Tango/6

I think you’re wrong about RMSE. If OBP is a random variable, and each player’s expected OBP is centered around some true-talent value, then expected squared error will be minimized when our projection nails the true mean OBP for each player.

I wonder if you’re thinking of some other phenomenon. For example, if we’re trying to predict something stupid, like what OBP will the best-performing (unnamed) player have in 2012, then the values we predict will be “stretched out” relative to the values we would predict if we were predicting the expected OBP of each named player.

Because what you’re saying about RMSE just isn’t true if we measure squared error appropriately (as the square of the difference between a named player’s actual OBP and his projected OBP).


#11    Tangotiger      (see all posts) 2011/11/21 (Mon) @ 16:15

Bill: ack, what was I thinking?

I mean that you can’t do what I said with CORRELATION.  Yikes.  I had a brain freeze.

RMSE is the correct thing to use, as it does what you say.

CORRELATION baselines both the mean (like RMSE), and the slope (which is the silly thing it does and should not do).

I’ve said these two points repeatedly over many years, arguing constantly in favor of RMSE over correlation for this reason.

Sorry for getting all confused.


#12    Tangotiger      (see all posts) 2011/11/21 (Mon) @ 16:19

Here’s the relevant discussion (post 45):

http://www.insidethebook.com/ee/index.php/site/comments/testing_the_2007_2010_forecasting_systems_official_results/#45

“this is standardizing the slope”

Right.  But WHY would we standardize the slope?  There’s no reason to.  I already showed the example of what happens if you standardize Marcel’s slope.  That’s why I’m against correlations, because it forces the slope.

Marcel has decided that the top-end wOBA is Pujols at .439.  By standardizing the slope (changing the SD from .026 to .030 let’s say), you are giving MArcel a defacto forecast for Pujols of .030/.026*(.439-.335)+.335 = .455.

This is why I call it mathematical gyrations.  By standardizing, you are hiding what you are in reality doing.

What you are really doing is giving each of the forecasts the same SD for its forecasts.  But hiding it by making the SD = 1.  But, make the SD = .030 for everyone, and then see what you get.  It’s obvious that Marcel gets killed here.

But MArcel CHOSE to have a more shallow slope.  You can’t then tell it to give it a steeper slope just to fit it to what you want!

Furthermore, you are also standardizing the OBSERVED!  So, Pujols’ actual .449 will get standardized down to (.449-.335)*.030/.044+.335=.413.

So, now you have Marcel’s standardized wOBA as .455 compared to his standardized actual of .413.  Even though in reality Marcel said .439 compared to Pujols .448.

Sorry.  This is mathematical gyrations, standardizing because that’s what is normally taught to do in stats class.  You can’t do it here.

...

This is the reality you have to deal with:

1. Marcel and the others forecasted the wOBA, and they did not forecast a z-score.  The slope is the slope and you can’t change it.

2. We are testing against the observed, which means true + random luck.  We are not presuming that luck is linked to true, which is what standardizing the observed implies to do.


#13    KJOK      (see all posts) 2011/11/21 (Mon) @ 17:12

Tom - I’ll send you the file so you can do whatever statistical test you’d like.  My criteria was a player had to play for two completely different sets of teams between 2008 and 2009 in order to be not be a ‘same park’ player.

I used 2009, which had two new parks in New York, in order to get a decent amount of ‘park switchers’.  I don’t think this should bias the sample TOO much, but not entirely sure.  Most rookies were in the ‘not same park’ group, and perhaps they need to be excluded.


#14    Tangotiger      (see all posts) 2011/11/21 (Mon) @ 17:19

Kevin: I’m not sure what file you intend to send me.  If it’s getting players who were pure team-switchers, I can derive that myself.

In order to do the testing I’m envisioning, I need the Marcel, Oliver, Chone, PECOTA, ZiPS, etc forecasts for those players.


#15    KJOK      (see all posts) 2011/11/21 (Mon) @ 17:29

Tom - Sorry, I don’t have the data for all of those. 

What I envisioned was just looking at Marcels and seeing if MARCELS were less accurate for park-switchers than for non-park-switchers.


#16    Tangotiger      (see all posts) 2011/11/21 (Mon) @ 17:29

Kevin: ok, got your file.  I can only take a look when I get home.  Note: in future, please send to tom~tangotiger~net


#17    Brian Cartwright      (see all posts) 2011/11/21 (Mon) @ 18:42

It’s been my opinion that the problem Clay is talking about is that Marcel and all of the projection systems project true talent to a degree that is smaller than a single season’s worth of random variation.

With around 600 PA, there’s about .025 wOBA of random variation. Made up numbers for example purposes only, but if God knows that Oliver can estimate to .005 and Pecota to .010, how can we test that when the single season (out of sample) that we are comparing to is .025? Every measure smaller than the single season random variation, regardless of how good it might be, when compared to that single season will get roughly the same result - that single season value. That’s why the test won’t show anybody better than Marcel. I call it the event horizon - we can’t see inside it.

The projection test you ran last winter resorted to some alternate methods of getting separation, such as matching two systems head to head on each player. If they had different projections, which one was better, then add up the wins, losses and ties.


#18    Bill Waite      (see all posts) 2011/11/21 (Mon) @ 21:04

17/Brian

I don’t think it’s true that you “can’t see” the accuracy of each method just because random variation inherently creates some amount of error.

If the variance due to randomness is .025, and we perfectly nail the true talent level of every player, we’ll expect to see a mean squared error in our predictions of .025.

If the variance due to randomness is .025, and we MISS the true talent level of each player, the mean squared error will be larger. For example, if our mean squared error relative to players’ true talent levels was .01, then we would expect to see a mean squared error relative to players’ single-season performance of .035 (.025+.01).


#19    MGL      (see all posts) 2011/11/21 (Mon) @ 21:17

Right, you can see it (sometimes). It is just that the certainty of what you see is not necessarily large.  If one system beats another system in any sort of reasonable test, then it is more likely than not that that system is better.  How much better and the certainty of that statement (that one is more likely than not to be better) is another story and not only depends on the parameters of the test (sample sizes, standard errors, etc.) but on our “a priori’s” as well (remember that almost anything in life is a Bayesian problem).


#20    Bill Waite      (see all posts) 2011/11/21 (Mon) @ 21:28

17/Brian and 18/myself

On the other hand, random variation does create an RMSE “floor” which you can’t possibly go below, and I think that makes many RMSE improvements (which are orders of magnitude smaller than the RMSE floor) psychologically unsatisfying, even though some subset of players might be much more accurately modeled than they have been in the past.

I’ve run across this problem trying to model opponents in poker. I can take an objectively bad model (like one that stupidly predicts that some things have a negative 5 percent chance of happening) and replace it with a model that appears to be quite good, and see an RMSE improvement less than .001. And when the RMSE goes down from .1893 to .1885 (or something), the difference LOOKS tiny.

I find that a better way (psychologically) to make myself feel good about my improvements is to focus in on the category of hands that I was mispredicting before, and in addition to looking at overall RMSE, I look at binned averages of predictions and outcomes.

This category may make up only 5-10% of the total number of hands, and even for that category of hands, the RMSE difference might not appear to be very large even for a significant misprediction, so I convert it to something that’s more visible.

When I see that my old model had an average prediction of 12% for some category of hands where the average outcome is 27%, and my new model predicts 26% (or whatever), it’s a lot more satisfying psychologically than just seeing an RMSE improvement that SEEMS so small.


#21    Bill Waite      (see all posts) 2011/11/21 (Mon) @ 21:45

Mike/3 and 20/myself

I’m inclined to agree that RMSE doesn’t provide much context (aside from better/worse than last model tried). If some change creates an RMSE improvement of .001, it isn’t immediately obvious what that means in terms of wins or in terms of any individual player’s prediction.

On the other hand, if we look at binned averages (e.g. we take the small percentage of hitters who moved from a pitcher’s park to a hitter’s park each year since 1960) we may see that for some category of hitters, the new model predicts wOBAs an average of 15 points higher than the old model, and 15 points closer to the true next-season wOBAs (or 5 points, or 2 points). A difference like that is immediately understandable.

RMSE is still the ideal for simple head-to-head comparison to figure out which model is better overall. If we only looked at cherry-picked subcategories, we could justify the use of just about any POS, but RMSE doesn’t lie.

But there are a lot of other things you can do on top of RMSE to help you understand the context of each RMSE improvement and decide how much the difference means to you.


#22          (see all posts) 2011/11/21 (Mon) @ 23:18

If the variance due to randomness is .025, and we MISS the true talent level of each player, the mean squared error will be larger. For example, if our mean squared error relative to players’ true talent levels was .01, then we would expect to see a mean squared error relative to players’ single-season performance of .035 (.025+.01).

Right, except that errors add in quadrature.  If the RMSE due to randomness is .025 and the RMSE due to System A is .010 and the RMSE due to System B is .005, the total observed RMSE for System A is .0269 and for System B is .0255.  And that’s for a pretty significant difference across all the players being tested.  If there is a subset of players (unknown to us) where one system does that much better, but it’s only 10% of the players, the RMSE difference over the whole set of players will be practically indistinguishable (.0250 for both).


#23          (see all posts) 2011/11/21 (Mon) @ 23:27

Correction on my last sentence in #22.  I should have written, “the RMSE difference over the whole set of players will be practically indistinguishable (.0269 for A and .0268 for B).”


#24    Bill Waite      (see all posts) 2011/11/21 (Mon) @ 23:45

Mike/22

In 18, I’m talking about MSE (not RMSE), which does add up in the simple way I described. (I just used example MSE numbers so I could conveniently add them.) Not that it’s all that important.

But what I said in 18/20/21 still stands; RMSE will detect the improvements, even if the difference between .0269 and .0268 (for example) looks small to the human eye. But RMSE doesn’t give us much context outside of simply better or worse (and possibly degree of confidence that a particular model is better).


#25    tangotiger      (see all posts) 2011/11/21 (Mon) @ 23:53

Even if it’s (statistically) significant, how is it practically useful?

That is, how big a deal is it if the best forecasting system will have forecasts that will beat Marcel 52% of the time?  And what do you do with the systems that beat Marcel 48% of the time?

I mean, create enough forecasting systems, say 10 of them, and, just by luck, 5 will beat Marcel and 5 won’t.  So, ZiPS one year is better than Marcel, and another year it’s worse.  Chone is better two years in a row, and PECOTA is worse two years in a row.  What does it all mean?

This follows with the green jelly beans, and the 19 other colors.


#26    Bill Waite      (see all posts) 2011/11/22 (Tue) @ 00:25

Yeah, reading the article you linked to grading the past performance of the different methods, I see that the year-to-year variance in RMSEs is around the same order of magnitude as the system-to-system variance. So you’d have to look at a large number of seasons (perhaps even more seasons than have ever been played) to say with much confidence that one system is the best.

As to practical value, I don’t know. If I had a system that beat Marcel 52% of the time, I certainly wouldn’t know how to make money off of it. But if an MLB GM could significantly out-predict everyone else for some identifiable subset of players (even if it’s just 2% of the player population), it must be worth something to his team. But I guess identifying that subset of players would take a lot more work than just a single RMSE comparison.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 03:39
Lack of hustle during a game

May 25 02:54
Largest demonstration in Canadian history?

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 25 00:36
Help needed with sticky issue…

May 24 23:50
Rooting for laundry

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story

May 24 09:41
Racial bias in card collecting: not the collectors, but the players on the cards