THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, January 16, 2007

Unified Forecasts

By Tangotiger, 02:15 PM

I was going to do this for all the players in the league, and Ryan here gets a headstart on the A’s.  If someone wants to help me out, provide me with a spreadsheet of player names, ID, OPS or ERA for whatever forecasting system(s) you’re going to input or parse (Zips, Pecota, Shandler, James, Tippett, Chone, etc).  If you want to include the full batting or pitching line, that’s ok too.

The “ID” should be the playerid on Lahman database, or Retrosheet.  If not, then something else to help identify him, like AB, H, or IP from 2006.  This way, we can present a cumulative of all forecasting systems out there, which should be better than any single one of them.

Anyone want to help out?


#1    Trader Joe      (see all posts) 2007/01/16 (Tue) @ 14:45

Aren’t the pecota’s copyrighted?  I imagine you can publish the combined (averaged) figures from several systems but can you publish the individual pecota forecasts?


#2    Tangotiger      (see all posts) 2007/01/16 (Tue) @ 14:59

I wasn’t going to publish anything individual, just overall averages.

***

I respect the copyright of any author.  That said, whether Ron, Nate, Tom, Dan, Bill, Sean would have any copyright claim, I have no idea.  Algorithms are not copyrightable.  Ideas are not copyrightable.  Expressions of ideas are. 

I published the Leverage Index numbers.  Do I have a claim over it?  I don’t know.  It is original, but if there’s only one way to represent it, then, I don’t think it holds water.  (In any case, I revealed the secret recipe, so anyone can have at it.)

I published Win Expectancy numbers in The Book.  Do I have a claim over it?  I don’t know.  It’s based on an algorithm, and there’s nothing really original in there.

If someone wants to debate the issue, go right ahead.


#3    Trader Joe      (see all posts) 2007/01/16 (Tue) @ 15:50

I think you probably have to distinguish between publishing the raw data (forecasts) in the case of pecota and using the data in an analysis.  I think BP claims a copyright over the data for publication but certainly expects people to report parts of it for discussion (under some notion of “fair use"). And it certainly expects people to use and discuss the data as many people have in the past, not just for their own analyses and fantasy bb purposes but also by comparing forecasts from pecota against other forecasts (for individual players, teams, or what have you). But that’s different from publishing/distributing the raw estimates in a spreadsheet. 

In my business I use a number of data sources that I may acquire from (copyrighted) books, government docs and websites, and other sources (e.g., data collected by other researchers but distributed in data banks or directly to me in response to a request to share their data).  In one sense all of these data are available for use and analysis once they are published, and if someone is making scientific claims based on them. 

But if somebody has provided those raw data with a reservation of some kind (such as BP claims), I can’t replicate and redistribute the raw data without permission except for a few representative or illustrative figures or in the form of summaries or as input to an analysis.  That is, I can analyze and compare the data to other figures from other sources, but I can’t reproduce all the data and distribute or publish it for someone else to use.


#4    Rally      (see all posts) 2007/01/16 (Tue) @ 15:57

I have no idea what copyright laws apply, but for the CHONE projections what’s on my blog is public use.

Sorry about not having player ID’s.  The reason is that I use minor league stats which don’t have the lahman ID.  I combine their stat lines just with player names.  When I get a conflict, I just find a way to make the names unique, such as using Alex and Sea Bass Gonzalez.


#5    John Beamer      (see all posts) 2007/01/16 (Tue) @ 16:21

I think the LI example is different because the methodology is published so is relatively easily replicated (same as Marcel). But if I suddenly copied the CHONE forecasts and called them, BEAMER, for instance I’d imagine you probably wouldn’t speak to me ... if I then charged people $10 for them (a) they’d be stupid because they could get CHONE for free, and (b) you would probably be within your rights to sue me.


#6    Tangotiger      (see all posts) 2007/01/16 (Tue) @ 16:27

Joe, I should reiterate that I’d only be distributing the average results, like Ryan on the A’s site, and not the individual data. 

Now, “data” is not copyrightable, as noted in the landmark Feist/Rural case.  The question is: what is data?  I can reproduce almost the entire Retrosheet website, as-is, and not have an issue (just like Feist/Rural).  There is almost nothing original in its presentation, and almost all its data belongs to the public domain.

Now, what is a “forecasted data”?  That is likely not “data”, but just an idea represented as a number.  Ideas are not copyrightable.  Expressions of ideas are.  Is a forecasted data an expression of an idea?  Probably not. 

As for “terms of use”, those are the worst worthless pieces of agreements around.  Just because a TOS link is published, doesn’t mean the reader actually clicked it, read it, and/or agreed to it.  That’s why with software, they make you actually click “I agree” before proceeding.

A TOS in a book would be different.  If you have been granted a licence to analysis (like forecasted data would be), and that licence says not to redistribute, then you can’t.  It doesn’t even fall under “fair use”.

However, that doesn’t mean that contract law will preempt copyright law, or vice versa.  Only a judge of a specific case knows the real answer.

I’m not a lawyer, and only play one in cyberspace.  Don’t take legal advice from me, or anyone on this site.


#7    dkappelman      (see all posts) 2007/01/16 (Tue) @ 16:28

You might have an issue if you were to go out and sell the conglomerated projections without the permission of all the projections “owners” since it’s most likely a derivative work.  Otherwise, probably falls under fair use. 

You could reverse engineer the projection systems and publish your own identical set under a different title.  I’m assuming Pecota and Shandler’s systems are considered trade secrets and unless you legally obtain the knowledge on how to replicate them, you’re out of luck.


#8    Tangotiger      (see all posts) 2007/01/16 (Tue) @ 16:28

John: but what about LI, prior to my revealing the recipes?


#9    dkappelman      (see all posts) 2007/01/16 (Tue) @ 16:49

Tom: If you didn’t release the methodology of how to do LI, then it’s most likely a trade secret and can be considered IP, until someone figures out how to do it themselves.

I’d think even if someone knew how to calculate it and decide to just “steal” it off your site or from your book without putting in the work, that would be a problem.

Baseball stats are more or less public domain, but let’s say someone wanted to “scrape” ESPN’s baseball stats (STATS LLC stats) and use them in a for profit way, that could easily result in a cease and desist letter.

That’s why baseball-databank.org is nice, because from what I can tell, it has zero limitations on use.  Few things out there do.


#10    Tangotiger      (see all posts) 2007/01/16 (Tue) @ 16:55

IP is about trademark, copyright, and patents.  (In the US anyway.) None of which applies here.

“Trade secret” would be bound by contract law, not IP.  And the “trade secret” is the process, not the result.  And, if I can deconstruct it, I don’t see how it is a “trade secret”.

My provision that I’m not a lawyer still applies.


#11    tangotiger      (see all posts) 2007/01/16 (Tue) @ 17:43

I stand corrected on “Trade secret”:

http://www.uspto.gov/web/offices/ac/ahrpa/opa/museum/1intell.htm


#12    dkappelman      (see all posts) 2007/01/16 (Tue) @ 18:03

I should note I’m playing lawyer too.  But it’s always fun to play one, until a real IP lawyer shows up and tells me I’m wrong....

Speaking of contracts, at least one subscription based baseball website does a pretty good job keeping their formula’s relatively safe with their “click wrap” contract.  Just shows why it’s actually worth reading the TOS before you tick the checkbox.


#13    Rally      (see all posts) 2007/01/16 (Tue) @ 19:38

How does a “click wrap” contract work?


#14    Rally      (see all posts) 2007/01/16 (Tue) @ 19:40

Tango, have you done anything on the CHONE spreadsheet yet? 

I could get most of the Lahman ID’s in for the players with major league experience but I don’t want to duplicate work if you’ve already done this.


#15    tangotiger      (see all posts) 2007/01/16 (Tue) @ 20:03

Nope, haven’t done anything yet.  And maybe I’ll ask Chris or Jeff to provide their minor league IDs, so we can get something more standardized going too.


#16    dkappelman      (see all posts) 2007/01/16 (Tue) @ 20:29

Usually right before you subscribe to a website, you’re required to check the box that says you agree to the TOS for that particular site.

Basically you’re willingly engaging in a contract with the provider and are subject to the terms that you “should” have just read.

For instance, on BP’s site: https://baseballprospectus.com/store/expresspurchase.php?t=newsub&q_sub_type=premium

requires you check that you agree to this:
http://www.baseballprospectus.com/tos/

where there’s a bit about IP and Fair Use. 

I don’t see anything out of the ordinary on BP’s, (not that I read the whole thing) but there’s occasionally extra “stuff” in there.


#17    Rally      (see all posts) 2007/01/16 (Tue) @ 20:30

I did a vlookup on it, and now I’m just editing the cases where my name is off a bit from last,first from the master file.  Theres quite a few of those cases.

I can probably give you a decent file by the end of the week.


#18    Rally      (see all posts) 2007/01/16 (Tue) @ 20:31

Another cool feature of unified projections is we could measure the amount of agreement for each player.


#19    dkappelman      (see all posts) 2007/01/16 (Tue) @ 20:37

Tom/Rally, would you have any objection to me doing something with Marcels and/or CHONE projections on FanGraphs?  I thought it might be neat to add them to the stats pages, maybe even do something with the usual set of graphs.


#20    tangotiger      (see all posts) 2007/01/16 (Tue) @ 22:40

Have at it.  As I did with Hardball Times, anyone who wants to do anything with the Marcels, feel free.  As long as you don’t charge people for it, and link back to my site, it’s for anyone to use.

As for BP’s (or anyone else’s) TOS, for subscribers/purchasers, you are bound to the licence.  For non-subscribers, you are not. 

I’m not a lawyer, so don’t listen to my legal advice.


#21    philly      (see all posts) 2007/01/17 (Wed) @ 00:07

I’m in the process of compiling projections for Red Sox players.  When I’m down I can send whomever a spreadsheet with just the averages.

I’ve only included the players expected to make the majors.  The Sox have 13 established posiiton players.  The pitching staff is more fluid, but I’m only doing what looks like the top 12.  Some systems have Matsuzaka projctions and some don’t.  None of them - even PECOTA which surprised me - have prokections for Hideki Okajima.

Just a quick summary of the cumulative team projections.

OPS
ZiPS - 823
Marcel - 824
CHONES - 820
James - 821
PECOTA - 831
Shandler - 841

Not too suprisingly there’s pretty strong agreement there.  The top 4 are dead on with PECOTA and Shandler a bit more optimistic.

ERA/FIP
ZiPS - 4.53/4.25
Marcel - 4.57/4.35
CHONES - 4.18/4.28
James - 3.79/3.88
PECOTA - 4.55/4.22
Shandler - 4.23/4.15

Note: these are all with Matsuzaka excluded.  His very good projections generally bring the cumulative projection down by about 0.10.

Much great spread here with ZiPS, Marcel and PECOTA around 4.5, Shandler and CHONE around 4.2 and James a amjor outlier at 3.79.  I’m really curious if the James projections are that much lower acorss the board, but James/BIS is the only one that isn’t released as a spreadsheet.

I included FIP because when the Sox signed Matsuzaka Silver posted his PECOTA which was relatively high and then pointed out that that was actually a good projection because of the league/park/defense environment.  From that I assumed that PECOTA would have a relatively large ERA-FIP difference and that’s true.

The CHONES and James actually project the ERA to be lower than the FIP.  That’s a change from CHONES 2.0 to CHONES 2.1.  In that change the Sox ERA projections dropped quite a bit, but the FIP ERAs mostly stayed the same.  I would bet that at least for the Sox the 2.0 projections would be better mostly for that reason.

Note that the FIP ERAs are clustered much more closely than the ERAs.  James is still a major outlier but the other 5 systems are between 4.15 and 4.35.


#22    Rally      (see all posts) 2007/01/17 (Wed) @ 09:58

Dave, feel free to post the CHONE’s on your site.  Just give me credit somewhere with a link.


#23    bedir than average      (see all posts) 2007/01/18 (Thu) @ 12:56

Tango, last year, working with insidethepark.com we compiled composite projections of OPS and ERA for the Mariners.  We used Marcel, ZiPS, the BJ Handbook but then we also included projections from a handful of scouts as well as from the posters on the board through a series of polls.  I then tried to adjust for batting order and playing time to get a composite projection for the team OPS and ERA in order to project team wins.

On individual basis I fealt that the individual projections were fairly accurate, but that my playing time estimates were way off.  The idea of including “scouting” style information in the projection by going both to the masses and to insiders I found intriguing.  Somewhat similar to your Fan Scouting Report of Defense, but projecting offense/pitching ahead.


#24    tangotiger      (see all posts) 2007/01/18 (Thu) @ 13:11

bedir, I was thinking the same thing.  I was going to do a “Wisdom of the Crowd” / “Community” forecast, in addition to compiling a unified forecast from the pros.  In essence, repeat this project:
http://www.tangotiger.net/forecastFinal.html

But do it for all players. 

It’d take some effort, and I’m not sure I can invest my time on it.


#25    John Beamer      (see all posts) 2007/01/18 (Thu) @ 13:12

Rally—I am right in thinking that you looked at the R of some combined forecasts for 2006 and you found that some of the individual forecasts (I am thinking PECOTA specifically) still had a higher R ...


#26    Rally      (see all posts) 2007/01/18 (Thu) @ 14:25

Yes, last year Pecota beat the combined forecasts.  Can they do it again?  We’ll see.


#27          (see all posts) 2007/01/19 (Fri) @ 01:19

Would the unified/combined forecasts weigh each individual forecast equally?  Might this skew the results toward those of forecasting engines that are most similar to each other?

I might guess that the most open algorithms tend to be similar since they use widely discussed techniques.  Proprietary engines (PECOTA, Shandler) are probably more difficult to replicate.  I wonder if this might explain why PECOTA beat the combined forecast last year.  Maybe a better combined forecast would give only half weight to two individual forecasts that are most similar (or just drop one of them altogether).


#28    tangotiger      (see all posts) 2007/01/19 (Fri) @ 07:58

Let’s not forget that PECOTA beat the combined last year, for hitters only, and this based on 100 or so batters.  I doubt the difference would be statistically significant. 

I also strongly suggest that the proper way to do the test is:
1 - each forecasting each provide a population mean (and not let the tester use the sample mean)
2 - use RMSE, not correlation

Why?  Remember that we are looking at
y=mx+b

That “b” cannot be floating, based on what your sample is.  If I look at 100 batters, maybe the sample OPS mean of Shandler is higher than Bill JAmes, but if I look at 300 batters, it could be the reverse.  That makes no sense.  So, you absolutely need to do OPS minus pop mean (for each forecaster).  That’s why, when I did my mini-study three years ago, I asked each forecaster to provide the league mean they were basing their OPS against.

For the second case, you can’t have the slope (m) be anything but 1.  In a correlation, it will assign whatever m it wants (though typically very close to 1.00 in our cases here).  Still, there’s no reason for the correlation to assume anything but 1.  It’s cheating otherwise.  Let’s say all forecasters have forecasted OPS of +/-.250 (say Pujols is 1.000, and some bum is .500).  But, someone else comes along and calls Pujols .875 and that same bum .625, and everyone in-between is similarly scaled.  Guess what?  When you run the correlation, you will get the exact same correlation, because all that’s changed is the slope (correlation will assume m=2).

That’s why you need to do RMSE (or simply a straight absolute differences from the mean).


#29    Trader Joe      (see all posts) 2007/01/20 (Sat) @ 14:38

If I recall correctly, in his first comparative assessment of PECOTA a few years ago, Silver used R-sq *and* RMSE, and included and all cases (players) who were included in the preseason estimates of all the forecasters (i.e., a common set of players) from DiamondMind, Shandler, and a couple of others.  Also he used far more than the 100 or so cases that Chone or Symborski reported recently for 2006.  (Since PECOTA makes some 1600 projections, the smaller N of 100 to 400 or so that we see in the typical comparison of systems is almost entirely due to omissions by the other forecasts.  Even Marcel provides no predictions for rookies or others with little major league experience.)

I also recall that at that time many on Primer were bitching about Silver’s use of RMSE! However, I agree with you that this is really what we’re after:  minimizing the “average error” in predicted player performance, not the correlation as such.  You could get a perfect 1.0 correlation (and explain 100% of the variance—R-sq of 1.0) and still be far off in your average error if you had some sort of bias built into your system (and I’m not just referring to the intercept being off due to case selectivity or something like that).


#30    tangotiger      (see all posts) 2007/01/20 (Sat) @ 17:25

Even Marcel provides no predictions for rookies or others with little major league experience.)

Correction: if you have 1 PA or BFP, Marcel’s got your number.  Marcel further assumes that anyone with 0 career PA or BFP gets the league average.  So, Marcel’s provides predictions for anyone who will play in 2007.


#31    bedir than average      (see all posts) 2007/01/20 (Sat) @ 17:28

Tango, is that new this year?  As Marcel did not have a projection for Kenji Johjima last year.  ZiPS did and it was rather high.


#32    tangotiger      (see all posts) 2007/01/20 (Sat) @ 19:40

It is implied in the Marcel process, since the regression toward the mean component automatically starts everyone off with a career 1200 PA at league average.

So, if let’s say the league average OBP is .3333, then everyone starts off with 400 times on base, and 1200 PA.  If you go 0-0 (have no career MLB), then your forecast is .3333.  If you go 1-1, then your forecast is .3339 (401/1201).  If you go 0-1, then your forecast is .3331.

I was thinking that I really should use a mean component of .310 or .320.  But, when it comes to testing, we typically only look at players with at least 200 or 300 PA.  And, in those groups of players, if you are a rookie, chances are you’re closer to the league mean than not.  It’s a little cheating, but, what the heck.  Marcel doesn’t want to spend more than 2 minutes to run the forecast.


#33    tangotiger      (see all posts) 2007/01/25 (Thu) @ 12:38

Great work over at Fangraphs:
http://www.fangraphs.com/statss.aspx?playerid=1857&position=C

David is now showing the forecasts for 3 systems, including Bill James.  Now, we just need to get ZiPS, Shandler and/or PECOTA over there…


#34    dkappelman      (see all posts) 2007/01/25 (Thu) @ 15:21

Dan has graciously allowed me to post ZiPS, so those should be up later today or tomorrow.  I’m not going to hold my breath for Shandler or PECOTA....

I will throw Shandler and PECOTA in a dev database at somepoint with the BIS playerIds.  I’m not sure what date I should use for the Shandler projections since they are constantly being updated and they do not continue to carry their older projections.  Not sure if the same is true about PECOTA?

I could use the ones from the book I suppose, but I’m so into the idea of doing that much data entry.


#35    Trader Joe      (see all posts) 2007/02/02 (Fri) @ 22:27

Keep in mind that PECOTA comes in waves. The first one was posted on BP around Jan. 15, the second has just been posted (Feb. 2).  If I recall correctly, there will be at least two more iterations before the season starts.  Each iteration involves some adding of players to the overall set of forecasts, some changes of teams, the use of increasingly refined estimates of defensive support for pitchers, and the fitting of the estimates to revised and more plausible depth charts.  See today’s BP Unfiltered post: http://baseballprospectus.com/unfiltered/?p=176

The changes with successive iterations may not matter that much for the aggregate comparisons you’re planning to make, but presumably the last iteration is Silver’s best set of forecasts. They’re also what he uses for his forecasts of team RA and RS.


#36    HarryAbles      (see all posts) 2007/02/20 (Tue) @ 22:36

This is really a miniscule issue, but what’s the “correct” way to actually average the projections?  I thought of three different ones, shown here using the CHONEs and ZiPS for Pujols:

#1 - Just a straight average of all the counting stats, and then AVG/OBP/SLG based on those numbers (.32153/.42494/.64396)

#2 - Average the PAs (just H+BB+HBP here), and then the rates for each event - i.e. CHONE has him at .2755 H/PA, ZiPS says .2697, for a .2726 average.  Multiply average rates by average PAs to get counting stats, then same as above (.32159/.42490/.64399)

#3 - Same as #2, except use harmonic mean to get average rates (.32156/.42479/.64383)

The two projections are virtually identical, and I think somebody more controversial like Hanley would produce bigger differences, though they’d probably still be minimal.  Just wondering which is the right way to go about it.


#37    Chris Miller      (see all posts) 2007/02/20 (Tue) @ 23:19

I had the same question at Lookout Landing about the USSM/LL community projections.  They’re averaging the counting stats, then deriving the rates from that is that the best way, or is there a better way?  I would think it would overvalue higher PA projections, but it was brought up that there is a higher level of error in the low PA projections.  I’m definately no expert on an issue like that, it’d be interesting to hear some takes on the subject.


#38    tangotiger      (see all posts) 2007/02/21 (Wed) @ 09:20

I would do it based on a per PA basis.  And then average the PA.  Then multiply the two.

Fans are estimating a player’s talent level, and not figuring “well, if he has 200 PA, he might hit .380”.  We might observe something like that in reality, but no one will make such a forecast.

However, for pitchers, it’s different.  You *would* do such a thing for say Papelbon.  If you forecast him as a reliever, his rate stats will be much better. 

In the end, for USSM, it won’t matter, since most fans will forecast a PA level that is around the same.  And in any case, I’d go with the median, not the mean.


#39    tangotiger      (see all posts) 2007/02/21 (Wed) @ 13:27

David at Fangraphs writes a 2-page article for SI, of which I’m linking to the second page:
http://sportsillustrated.cnn.com/2007/baseball/mlb/02/19/fangraphs.projections/1.html

This is the kind of stuff I was going to do, so I’m very glad someone else is taking the bull by the horns.

If it wasn’t an SI article, I would have suggested not using “ranking” but “OPS minus populationOPS” as the measuring stick.

First, it keeps the forecasting system in context.  Secondly, we really do care how good Ryan Howard will be, since he might cost 27$ in one system and 34$ in another. 

You *could* try to equalize it even more by taking the differential method and dividing by the standard deviation.  I’m not a fan of that, since the forecasting system intentionally sets the slope, and we, as analysts, shouldn’t be in the business of trying to normalize that portion of a forecaster’s data.

***

When I’m going to evaluate the forecasts for the project, I’m going to use:
PA * (forecastOPSdifferential minus actualOPSdifferential)


#40    HarryAbles      (see all posts) 2007/03/16 (Fri) @ 10:22

Don’t know what has been done already, but I’ve got CHONE, ZiPS, Shandler, BJ/BIS, and PECOTA lined up and averaged for everybody (~2500 total players, ~800 listed by all five systems.) Let me know if someone could use it.


#41    tangotiger      (see all posts) 2007/03/16 (Fri) @ 11:37

Oooo.... that’d be great… saves me the trouble of doing it myself.  Please email me at:
tangotiger
at
yahoo
dot
com

Preferably, something like:
name, playerid, chone, marcel, pecota, shandler…
would be best.

If you don’t have playerid, that’s ok.


#42    bedir than average      (see all posts) 2007/03/17 (Sat) @ 15:34

I can’t wait to see what you do with that tango


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Nov 21 17:29
Sabermetric Moves of the 2009 Pre-Season

Nov 22 06:40
The New Triple Crown

Nov 22 06:24
Chance of Scoring by Base/Out, Retrosheet Years

Nov 22 02:48
How good are the Fans in evaluating fielding?

Nov 21 20:13
Runs Produced

Nov 21 19:27
Marcel 2009 is here

Nov 21 16:43
Nate Silver: hero to interviewers

Nov 21 10:57
New BBTN

Nov 20 20:34
ABSO-lutely… not!

Nov 20 19:23
R.I.P. Tom Boswell, sabermetrician; P.A.L.L.(*) Tom Boswell, human being