THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, April 13, 2009

Evaluating the 2007/08 Forecasting Systems

By Tangotiger, 12:32 PM

Dan Matt gives it a go.  His conclusion:

--CHONE was the best at projecting most things.

--PECOTA was very close behind but had some systematic biases, specifically for speedy players’ BABIPs, which ZIPS struggled with as well.

I will correct him here:

...and many young players were not projected by MARCEL

As my Marcel page points out:

FAQ: “But, what about a player who’s never played MLB? Where’s his forecast?” That’s simple. His forecast is the league mean over 200 PA, 60 IP (starter) or 25 IP (reliever). If you want to know what the league mean is, just take the average of anyone forecast with a reliability of 0.00. So, Marcel’s official forecast for anyone coming over from Japan is that.

There were 5000 nonpitcher in pro ball in USA last year.  It makes no sense for me to generate 4000 identical stat lines for the guys who never player in MLB in 2006-08.  And so, I make the declaration above.  Please, do not let the fact that I have no listed his forecast in the file preclude you from using that player in your tests.  Marcel HAS forecasted that player already.  I simply choose to not make the file obscenely larger than it needs to be by repeating the same line 4000 times.


#1    dan      (see all posts) 2009/04/13 (Mon) @ 13:04

Matt wrote that article. I’m not that smart wink


#2    Jared Cross      (see all posts) 2009/04/13 (Mon) @ 13:25

Really interested stuff.  Particularly the discussion of BABIP, I thought.  I checked the stdev of BABIP of the steamer projections and it’s .0165 so it’s likely regressing to the mean more than other projection systems which is interesting to know (I guess I can’t be sure of this since the pool or players isn’t the same). 

The one question I’m left with is whether CHONE was more successful going by RMSE simply by virtue of predicting the league average better.  Would CHONE have outperformed these other systems if evaluated using correlations (which would have no penalty for missing the lg aveage)?  I realize that using correlations has its own downside since there’s no penalty for having badly missed the spread of results.  I’m wondering if the best way to analyze projection systems might be to adjust all of the projection systems based on how the projected averages compared to the actual averages and then find the RMSE’s… if that makes any sense.

Anyway, good stuff here.


#3    Tangotiger      (see all posts) 2009/04/13 (Mon) @ 13:43

Jared, yes, we talk about this all the time. 

My position is:
1. You treat the forecasts as their own universe
2. You rebaseline every forecast plus/minus to the forecasted league average
3. RMSE or absolute average error

You CANNOT use correlation, since that allows you to change the slope in addition to the intercept.  Otherwise, imagine I forecast Johan at 50% closer to the league mean, and CC 50% closer to the league mean, and everyone 50% closer to the league mean.  Guess what?  My slope will now become “2”, and I end up with the same correlation than I would have had if I forecasted them properly.

In essence, we are not forecasting Beltran to hit 25 HR, but that he will hit 10 more HR than the average player.  It’s irrelevant if Chone says that average player is 17 and Marcel says it’s 12 (all numbers for illustration purposes).

This is because the “average” is the same if you use each system on its own: 260$ for 23 players.

Therefore, I reject any evaluation that:
1. give it a plus if the forecaster correctly nails the league average
2. allows the slope of the forecast to change

What we care about is the difference between forecast and mean of the population.  The wins or runs to dollars conversion should be the way to think about this.


#4    Jared Cross      (see all posts) 2009/04/13 (Mon) @ 14:05

Makes sense.  Sorry to be the guy walking into the middle of a conversation here.

I realize now (what should have been obvious before) that this same thinking applies to making our projection system.  It was designed by looking at past years and seeing how we could have projected them best and we looked at which equations gave us the highest correlation and (sometimes) lowest RMSE between our projected results and actual results.  Really, we should have been looking at what equation gave the lowest RMSE (or possibly just average error) once our projected mean was adjusted to the actual mean.


#5    ubelmann      (see all posts) 2009/04/13 (Mon) @ 15:05

I think we should be using something like a weighted RMSE, similar to a chi-square statistic, to test how well the projection systems predict player performance.

Say we have a set of 100 coins.  Some of the coins are flipped 30 times, some are flipped 150 times, some of the coins are flipped 600 times, etc.  When judging whether someone understands how the coins are weighted, we shouldn’t throw out all of the coins that were flipped fewer than 300 times--we should just acknowledge that the variance of the sample mean with a smaller sample is going to be larger than the variance of the sample mean with a large sample and weight accordingly.

Using a strict RMSE test presumes, for one, that all of the data points are equally reliable, whereas I think that systems should be expected to be closer on Jose Reyes when he gets 763 PA than Omar Vizquel when he gets 300 PA.  Also, by making a sharp cut-off at 300 PA, we’re throwing out a lot of data points that projection systems ought to know something about, even if we don’t expect them to be projected as precisely as those players with 700+ PA.


#6    Tangotiger      (see all posts) 2009/04/13 (Mon) @ 15:29

I agree with ubelmann.  When I run my tests, I weight the samples based on how many actual PA they ended up with.

For those who are interested, we talked about all this 1.5 years ago:
http://www.insidethebook.com/ee/index.php/site/comments/forecast_evaluations/

Please read the main thread and at the very least the first post.

abs(actual minus adjustedForecast) * actualPA

sum all that for all players, and divide by the sum of actualPA.


#7    Rally      (see all posts) 2009/04/13 (Mon) @ 15:59

Very good article by Matt.  Very interesting. I did some forecast evaluations a few years ago, but stopped for two reasons:

1) As a forecaster, I could be biased
2) Too many other things to do

But I was just looking at OPS and ERA.  Very interesting to see how the systems stack up on the details.


#8    dan      (see all posts) 2009/04/13 (Mon) @ 17:51

Okay, just read the article. The thing that stood out most to me was that CHONE did so well for righties, but struggled for lefties. I don’t see any reason why that would be so.

But in general, I guess this article could help each forecaster improve his model a little bit.


#9    Brian Cartwright      (see all posts) 2009/04/13 (Mon) @ 20:52

Yeah, there’s quite a bit there to go through. I knew Matt was working on it, but didn’t know how many comparisons he was going to do. I will definitely evaluate where Oliver can be improved, and I’ve already got a short list of things to work on.

Hopefully by next week (I have a paid programming project and a tax return higher in the queue) I will look at how the four systems project minor league players, using Marcel as the MLB baseline. Unfortunately, there are too few years to build a very large sample size, but I will see what I can do. In addition to RMS, I’m going to try a similarity score, based on the t-scores for each player in
1. (h-hr)/bip
2. (do+tr)/(h-hr)
3. hr/(ab-so+sf)
4. bb/pa
5. bb/pa
using the distance formula recently discussed. My idea is to see how closely each system is able to project each minor leaguer on those five stats, and for those five collectively how well each system can profile the player into a score similar to what Marcel says the MLB record is.


#10    Matt Swartz      (see all posts) 2009/04/13 (Mon) @ 22:39

I did not realize that Marcel did actually project all of those players.  I didn’t do the spreadsheet stuff in a terribly efficient way, so it will take a while for me to redo it with those projections in there, but I will definitely try to at some point.  Using PA-weighted tests might be helpful too, so I’d like to try that at some point too, at least after I do something on pitching projections.

I did do correlation tests as well, and the results were pretty similar.  The only difference was that Oliver did a bit better on correlations for the three true outcomes since it was missing the mean on them.

Instead of Chone struggling with lefties, I would actually argue that perhaps Pecota using comparable players might actually have been able to capture something unique about these players in using comparables.  Also, lefty hitters are less likely to play C, 2B, 3B, and SS, perhaps Pecota was just better able to project power hitters in general-- Pecota were better than Chone at projecting OPS for guys who hit over 15 homers and those are guys are bound to be disproportionately left-handed.

Brian, I like those ideas for tests.  I think that in general running these kinds of tests are important, since just testing summary statistics like OPS misses the point when people are interested in the projection of the individual outcomes that make it up.


#11          (see all posts) 2009/04/14 (Tue) @ 01:11

It just goes to show (what we all knew) about how hard it is to eek out that extra 1% accuracy in projection systems. And what for?

Would love to see the weighted RMSE scores but great job to Rally for consistently beating PECOTA despite (I’m guessing) 5% of the development time ...


#12    Rally      (see all posts) 2009/04/14 (Tue) @ 09:36

I doubt 5% of development time.  Especially in the winter of 2006-07, I was working on it almost every day. Unless Nate had a team of interns working on it, no way he was putting significantly more work into the system, there just isn’t enough time in the world for one person to do that. 5% publicity maybe.

In regards to the system working better for righties and switch hitters than lefties, that is either a fluke or a product of covariance (lefties more often hitting for power or something like that).  I don’t even bring batter handedness into the system.


#13    Brian Cartwright      (see all posts) 2009/04/15 (Wed) @ 19:30

One of the things Clay said at the BP2009 book signing was that PECOTA is currently being migrated from spreadsheet to database.

I consider it a great programming feat to have of all what PECOTA did in a spreadsheet, I know I couldn’t do it, but that’s also what bulked it up and slowed it down. It took so long they could only run it a few times a year. Clay said once it’s in a db they can run it daily to update BP’s site.

I had the main concepts of Oliver in my head for 20 years, then spent a year in Excel, doing much copying and pasting. It only took a long weekend to move all the formulas over to Access, and build in some new features. I’ve been tweaking it in the 9 months since.


#14    Trev      (see all posts) 2009/04/17 (Fri) @ 17:37

Has anyone tried retroactive evaluations?  We’ve already evaluated CHONE v.2008 vs. actual 2008, but how does CHONE v.2008 (assuming some improvements over CHONE v.2007) do against the actual 2007?

I suppose this would have to be internal, but I’m wondering if any projection creators have ever done this.


#15    Brian Cartwright      (see all posts) 2009/04/17 (Fri) @ 18:17

I’m looking at that right now, working on a future FanGraphs article. Matt and I have CHONE 2007-2009, ZiPS & PECOTA 2006-2009, and Oliver 1998-2009 (it’s mine so I can do back seasons easy enough).

Matt was comparing to MLB stats in various categories, I am looking at projecting players still in the minors, by level and age, compared to a Marcel done the year after their rookie season (for example, a 2007 rookie, defined as <=150 before 2007, 150+ PA in 2007, uses Marcel (3 years) done at end of 2008).

Still analyzing and discussing with a few friends


#16    Rally      (see all posts) 2009/04/18 (Sat) @ 14:14

Trev, yes I have.  Even have a name for it - retrojections.  In the Matt Wieters thread a few weeks ago I answered the question of what projection systems would have thought about guys like Jim Rice, Andre Dawson, and Wade Boggs the year before they were rookies.

But I haven’t done it for a whole forecast year.  Just don’t have the time.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 05:00
Help needed with sticky issue…

May 25 04:38
The first time a pitcher has ever intentionally thrown at a batter….

May 25 03:39
Lack of hustle during a game

May 25 02:54
Largest demonstration in Canadian history?

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 24 23:50
Rooting for laundry

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story