THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Wednesday, July 21, 2010

Hot Pitchers better than Cold Pitchers, of equal talent

By Tangotiger, 03:13 PM

Bill James study.  The degree to which we observe a difference was pretty big for me:

From the years 2000 to 2009, I identified 504 “matched sets” like this in which two starting pitchers had nearly-identical records, but one was hot and the other was not.  Details:
...
For example:
Randy Johnson as of September 5, 2000, had made 30 starts with a won-lost record of 17-6, 2.45 ERA, 299 strikeouts.
Randy Johnson as of September 7, 2001, had also made 30 starts with the same won-lost record (17-6), same ERA (2.45), but 320 strikeouts.  But he was ten degrees hotter at that time in 2001 than he was in 2000.
...
The “hot” pitchers, in their 504 “next starts”, had a won-lost record of 199-175, an ERA of 4.28, and an average Game Score of 50.62. The “cold” pitchers, in their 504 next starts, had a won-lost record of 177-177, an ERA of 4.74, and an average Game Score of 47.94.

I like the overall idea.  The results seem pretty large.

I think we need better controls.  Pitchers with a 177-177 record (i.e., .500 on 354 decisions) don’t have a 4.74 ERA.  If they did, they come from hitter’s parks in the higher run seasons in the 2000-2009 decade.  In order to do a study like this, you have to eliminate potential sources of bias.  So, runs allowed per league average, adjusted for park, would have been preferable.  I don’t know that we even need to look at the W/L record.  There’s a few more tweaks that can be done.

Overall, I like it, I like the idea, I like the execution.  It just needs different eyes looking at it from a similar angle.


#1    Guy      (see all posts) 2010/07/21 (Wed) @ 15:44

How are “hot” and “cold” defined?  Does cold mean they’ve recently pitched poorly, or just aren’t “hot?”


#2    Tangotiger      (see all posts) 2010/07/21 (Wed) @ 16:01

It’s basically a running total of Game Scores, with more weight to the latest start.


#3          (see all posts) 2010/07/21 (Wed) @ 16:15

My first thought was a home/road issue, then days of rest. I’d also want to look at team wins and losses. Not all no decisions are equal, of course.


#4    guy      (see all posts) 2010/07/21 (Wed) @ 16:25

Game score is the answer. It’s basically FIP, with Ks weighted especially high. So you are basically adding info on true talent uder guise of ‘hot’ and ‘cold.’


#5    Tangotiger      (see all posts) 2010/07/21 (Wed) @ 16:31

Guy, you are suggesting that in the illustration above, where Randy Johnson “looks” identical at that point in time, was not in fact identical because one had a better FIP to that point in time.

And, by looking at Game Score (which has FIP components like BB and HR) as the source to decide whether RJ is hot or cold, it really has a talent bias that he didn’t select on to begin with?

Makes sense to me.  So, yeah, what James should have done is selected on average Game Score to that point in time, and then looked at his recent Game Score to decide hot/cold.  Then looked at performance after that.

Good stuff.

Lesson for all of us: when you look at a study, always ask yourself “what could contribute to selection bias?”


#6          (see all posts) 2010/07/21 (Wed) @ 16:42

I agree that there’s selection bias there, as Guy points out.  But is there really enough there for half a run?  That seems huge.

Although I suppose that when one pitcher is “hot” and one is “cold,” you’re comparing two extremes, so maybe half a run is about right.  A pitcher with a 4.00 ERA and few Ks might indeed be half a run worse in talent than a pitcher with 4.00 ERA and a lot of Ks.  Especially when the 4.00 is over less than a full season.

Good catch by Guy, as usual.


#7    MGL      (see all posts) 2010/07/21 (Wed) @ 17:03

As much as I admire Bill, I don’t know why it is, but his studies tend to be terrible, this one not excepted.  Plus, it would be nice if he were to cite similar work such as that presented in The Book.  It would also be nice if he were to give us more data in order to analyze his study, such as the collective stats for each group during the hot/cold streak, season-to-date before the “next start,” the “next start” only, and the end-of- year.

Lesson for all of us: when you look at a study, always ask yourself “what could contribute to selection bias?”

That is a gigantic lesson/issue in these types of studies.  It cannot be emphasized enough.

Honestly, when I see a BJ study, I don’t get too excited.  They are usually goofy and poorly constructed.


#8    studes      (see all posts) 2010/07/21 (Wed) @ 17:43

Game score is the answer. It’s basically FIP, with Ks weighted especially high. So you are basically adding info on true talent uder guise of ‘hot’ and ‘cold.’

Except the comparables are also based on things like strikeouts and walks.  In IP, K’s, BB’s, etc., the matched pairs are quite similar.

If anything, the fact that Game Score emphasizes FIP components reinforces Bill’s point.  Truly “hot” pitchers in the matched pairs study (one of three studies in the same article) are those whose K’s are up and BB’s are down, not those who are lucky on BABIP.


#9    Guy      (see all posts) 2010/07/21 (Wed) @ 18:03

Studes:  I was relying only on Tango’s short summary above.  If the matched pairs are identical (essentially) on Ks and BBs and HRs, then I agree using GS to establish hot/cold may not create a bias.  I’d be interested in knowing the mean GSs for the two groups.  And I’d really like to know their pre-season projections. 

I don’t find the result all that implausible (though .5 runs is a lot), when you consider injury and aging.  If a guy was hurt earlier in season, he will look “hot” now—and in fact be better than his total stats suggest.  If he is hurt now, he will look cold and in fact be an inferior pitcher.  And did James provide ages on the two pools?


#10          (see all posts) 2010/07/21 (Wed) @ 18:10

A note about Game Scores: it includes hits, but not home runs outside of hits and runs allowed. I’m not sure if this changes anything about the study (the Game Score formula is kind of arbitrary), but it is not “basically FIP” by a long shot.


#11    Guy      (see all posts) 2010/07/21 (Wed) @ 18:39

Game Score is also heavily influenced by IP, so “hot” can mean pitching deep into games.  Since total season IP is the same, this could mean “hot” pitches are relatively healthy compared to earlier in season, and “cold” pitchers the reverse.


#12    Xeifrank      (see all posts) 2010/07/21 (Wed) @ 19:08

I heavily agree with MGL on this one (from what is described of the study here, site blocked at work).  The selection bias is huge.

I would rather compare pitchers based on which one has been lucky or unlucky (instead of hot vs cold) based on inputs into FIP or xFIP.  Of course doing the proper adjustments for competition and park etc…


#13          (see all posts) 2010/07/21 (Wed) @ 19:27

Wait ... on rereading, Bill DID control for K in the two groups.  So I don’t see where the selective sampling would have a big effect.

If you have two pitchers with the same ERA/W-L/K stats, but one has been hot lately in game score and the other has been cold.  Is there any reason to suggest that the first group would be different in talent than the second group?

You could argue that the first group might have fewer walks, because those weren’t controlled for in ERA/W-L/K.  But they are *partially* controlled for, in ERA.  Same for HR.  So where’s enough selective sampling to create a half run a game?  I don’t see it.

Here’s another suggestion: the ones that are “hot” now were likely colder earlier in the season.  That’s because they’re the same now, so hot now means cold then.  What kind of pitcher is still in the league if he was cold in April?  A pitcher who’s already proved himself—i.e., a better pitcher. 

X and Y are equal mediocre talent, ERA of 5.00.  X gets lucky and goes 4-0 in April.  He plays only to his normal talent in August and is now at 15-15.  He’s still in the league because he’s 15-15.  But he looks cold lately for 15-15.

Y gets cold and goes 0-4 in April.  He moves to the bullpen and is out of the study.

Z is a better pitcher, ERA talent of 4.00.  He’s unlucky in an 0-4 April, but keeps his job because everyone knows he’ll improve.  He then goes 15-11 and winds up 15-15 in August.  He looks “hot” in August, but he’s close to his actual talent, not that different for a normal month from a good pitcher.

X doesn’t get compared to Y.  He gets compared to Z, who’s a better pitcher.  And that’s part of the reason for the effect Bill found.

What do you think?


#14    Tangotiger      (see all posts) 2010/07/21 (Wed) @ 22:24

He controlled for K, but not BB and not HR and not H.


#15          (see all posts) 2010/07/21 (Wed) @ 22:54

BB/HR/H are an issue, but are they that big an issue that they would cause a half run difference? 

That is: suppose you found the players for whom there was the biggest discrepancy between expected game score and expected ERA, both ways.

What would those pitchers’ stats look like?  Suppose you took two of those players, one from each side of the discrepancy, whose expected game score was 50.  They would have to have different talent levels for their expected game scores to be the same.  What would that discrepancy be in expected ERA? 

It would have to be a lot more than 0.5 in order that Bill James’ sample, which is based only 5 games or so, came out to 0.5.  Wouldn’t it?


#16          (see all posts) 2010/07/21 (Wed) @ 22:58

Maybe the study could be done the following way, although a I don’t know if the biases discussed here would go away.

Run a regression using whatever stats you think best. The dependent variable could be what the pitcher does in a particular game and the independent variables could be is season stat up to that game and then for just the last 3-4 games. Maybe you could only look at starts for the second half of the season and only look at guys that had some minimum number of IP in the first half. Then see how the coefficients differ on your independent variables, the last 3-4 games and the whole season. If being hot matters, it seems like the most recent starts would have a higher coefficient


#17    Guy      (see all posts) 2010/07/22 (Thu) @ 07:01

Does James report the average game score for the two pools of pitchers (prior to “next game")?  That would tell us a lot. 

There is also a big potential publishing bias here.  How many hot hand studies has Bill done that turned up no interesting result?  We’ll never know....


#18    Tangotiger      (see all posts) 2010/07/22 (Thu) @ 07:09

Actually, in that same article, he did three different studies, of which two turned up nothing.

Indeed, it’s quite common for James to publish something that shows nothing.


#19    Guy      (see all posts) 2010/07/22 (Thu) @ 09:10

OK, good to know. 

I find James’ fascination with the hot hand, clutch hitting, outperforming pythag win%, “pitching to the score,” and similar topics to be quite ironic.  It’s as though he is trying to undo the damage he knows he did to many of the game’s most cherished myths.  He clearly wants there to be more to the game than context-neutral talent and luck.

But the genie isn’t going back in the bottle....


#20    dq      (see all posts) 2010/07/22 (Thu) @ 10:23

I think James got to 95% of the answer with his work, and has been trying to find out if he can answer the last 5%.

If runs created = 780, and the team scored 810, then why?

if the team scored 800 runs and allowed 800 runs, but went 86-76, why?


#21    KY      (see all posts) 2010/07/22 (Thu) @ 10:23

Why are matched-sets necessary when one could run a big multiple-regression?  Seems to me that all the matched-sets do is limit the sample size (500 or so in BJ’s case), and the smaller the sample size, the higher chance there is for extreme results.


#22    Tangotiger      (see all posts) 2010/07/22 (Thu) @ 10:42

Because with a regression, it is cold, and speaks to 5% of the population, and is forgotten two weeks from now.

I mean, we could have run regressions all over The Book, and where would that leave us?

In order to sell anything, you need to put a face on it.


#23    Guy      (see all posts) 2010/07/22 (Thu) @ 10:43

BTW, Gamescore IS a pretty good proxy for pitchers’ non-K talent.  I ran a regression to predict FIP, including K/9 as a predictor variable, and average GS is extremely significant and a good predictor.  I suspect that if you tried to predict pitcher ERA in Y2 using Y1 ERA, Y1 K-rate, and Y1 avg GS, the Y1 GS would have some predictive power.  So I do think a “hot” GS average means the hot pitchers are probably a bit better.  It could certainly explain a lot of the difference in W-L record.

That said, I agree with Phil that the underlying talent difference isn’t likely to be anywhere as large as .5 R/G.  My guess is the remaining gap is some combination of pitcher health, defense or other team/park factors (were team switchers included?), and—maybe—some small hot hand effect.  In likely order of importance: health, uncontrolled talent differences, team, hot hand.


#24    KY      (see all posts) 2010/07/22 (Thu) @ 10:44

sometimes I am a little slow and dimwitted, so I am asking to make sure I am not misinterpreting...is it correct to interpret post #22 as sarcasm?


#25    Guy      (see all posts) 2010/07/22 (Thu) @ 10:46

"I think James got to 95% of the answer with his work, and has been trying to find out if he can answer the last 5%.”

Well, the last 5% (at least) is luck.  What James is trying to do now is attach some kind of “meaning” to the random variance.

It’s his time, of course.  But as a fan of his work, I find it rather disappointing....


#26    KY      (see all posts) 2010/07/22 (Thu) @ 10:50

nevermind my post #24.  I just re-read Chapter 2 in the Book.


#27    MGL      (see all posts) 2010/07/22 (Thu) @ 17:25

Remember, we did find a small but significant hot hand factor for pitchers.  I don’t think it is as large as James’ study suggests it is (it really can’t be that large), but I think it is on the order of .1 or .2 rpg, depending on the magnitude of the hotness or coldness, how recent, etc.


#28          (see all posts) 2010/07/22 (Thu) @ 18:01

It looks like you say it is about .3 runs per game on page 61 and 62 of The Book. Maybe I am mis-interpreting something or misunderstanding something


#29    MGL      (see all posts) 2010/07/22 (Thu) @ 23:09

Whatever it says, that’s what we found.  But it entirely depends on the parameters of the hotness or coldness, so putting a number to it is meaningless, without also stating the magnitude and time element of the “streak”.

Plus, as with any empirical study with limited sample sizes (less than ginormous), we can only estimate the size of the effect with any reasonable level of certainty, right?  As Tango likes to say, “We are 95% (or whatever) certain the effect is non-zero, and our best estimate is Y.”


#30    Tangotiger      (see all posts) 2010/07/23 (Fri) @ 00:11

0.30 runs per game, meaning 0.30 runs per around 39 batters, means .008 runs per PA, or .009 points in wOBA.

It’s pretty acceptable to me that a true .320 pitcher, pitching “hot”, is actually a .311 pitcher.

The perception from fans on the hitter side is that there’s a 20 point wOBA difference in clutch (that a true .320 hitter will hit .340 in clutch).  Accepting that clutch is about half that is entirely fine with me.  That’s like giving a hitter +6 runs to hit total because of his clutchness.

So if James observed a 0.50 difference, I’d accept that the true may be as much as 0.25 (minus whatever bias he didn’t account for).


#31    MGL      (see all posts) 2010/07/23 (Fri) @ 23:01

I think what is a much more interesting task is trying to determine whether it is a uniform “hot (or cold) hand effect” or whether it is simply due to a pitcher being healthy or not.  Most civilians of course, think it is a “hot hand effect” and that whether a pitcher is healthy or not is a separate thing.  In reality, there probably isn’t a bright line between the two.


#32    james      (see all posts) 2010/07/24 (Sat) @ 12:16

I took a statistical approach to see if the difference between the hot and cold pitchers is just chance.

The w-L difference is not statistically significant in a chi squared test (p=0.52)although with only 504 pairs of games you have little chance of detecting a small difference. BJ should have estimated the fog first (done a power calculation) you would need 4000 starts to be even 75% likely to detect a difference of 3% in Wl records.

Trying to determine if the era difference is significant is a bit more tricky.
I added 10% to the era to represent unearned runs and then used the tango distribution of a pitcher with the E (and U) RA of the cold pitchers to sample 3000 innings (representing the 504 starts) to get the RA/9 of these pitcher. I did this 250 times and worked out the std deviation of the 250 simulations. Not surpissingly this was significant with a 2.5 SD difference between the two groups of pitchers. As the SD for 3000 is around 0.18 runs. Although for 162 innings the SD is .80 whcih was lot higher than i thought it would be.

I dont know how you could see if the gamescores were significantly different

Does this approach make sense?

James


#33          (see all posts) 2010/07/27 (Tue) @ 12:15

Guy, comment #19 said:

I find James’ fascination with the hot hand, clutch hitting, outperforming pythag win%, “pitching to the score,” and similar topics to be quite ironic.

Guy, do you have a link to James’ article on “pitching to the score”?  I don’t subscribe to James’ site, perhaps I should.


#34    Guy      (see all posts) 2010/07/27 (Tue) @ 12:52

Vic, the article I was thinking of was an analysis he did of Bert Blyleven’s career (who is considered by some to have done a poor job of “pitching to the score").  It appeared in one of the Hardball Times annuals, I think 2 or 3 years ago.  (Don’t have my copies handy.)


#35          (see all posts) 2010/07/27 (Tue) @ 12:58

Tango #18:

Yeah, I’d almost take that second item to be a swat at runs tests.  And the first item to be a swat at the way he used to execute these studies when he first started writing about baseball.

I haven’t read this article, only Alan Reifman’s summary.  Alan has been given permission by Bill to publish the full article on his site, at this time he has not done so.

I don’t know if you ever listen to NPR (I get WBUR out of Boston here) but I think that James was heavily influenced by his time with the Red Sox.  Just talking to Theo, and hanging around the batting cages ... I think he started questioning his own math.

His work since that time ... it’s as if it’s from a different author.  I don’t think the guy is through the looking glass or anything, but he may well be banging against it.  Very interesting cat, this recent version of Bill James.

These weighted rolling averages that he appears to be using here in the third study (with some warts of unknown size, as several commenters have pointed out) are very sensitive.

Most importantly, he’s asked the right question to begin with.  “Do hot pitchers usually get better results in their next game?” That, and wonderful writing, have always been his gifts in my opinion.


#36          (see all posts) 2010/07/27 (Tue) @ 13:11

Thanks Guy, I’ll google for that.  Sounds like he didn’t take a global approach. The specific player analyses are more readable, no doubt.  Still, that’s a shame.


#37          (see all posts) 2010/07/27 (Tue) @ 13:14

Phil #13,

That’s a terrific point.  Obvious once stated, but I hadn’t thought of it.  You should forward that to James, methinks.


#38    J. Cross      (see all posts) 2010/07/27 (Tue) @ 13:15

Not surpisingly this was significant with a 2.5 SD difference between the two groups of pitchers. As the SD for 3000 is around 0.18 runs.

James, if I’m understanding this correctly and the SD for each 3000 is 0.18 than the SD for the difference between these two averages should be sqrt(2(.18^2)) or 0.25.

so (4.74-4.28)/0.25 = 1.8, falling just shy of statistically significant.  Anyway, this certainly could have happened by chance but might be borderline unlikely enough to think about alternative theories?


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 08:11
What sabermetrics is NOT

May 25 06:43
Largest demonstration in Canadian history?

May 25 06:39
Lack of hustle during a game

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 24 23:50
Rooting for laundry

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story

May 24 09:41
Racial bias in card collecting: not the collectors, but the players on the cards