THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Wednesday, May 04, 2011

Overcast, with a chance of good hitting

By Tangotiger, 01:50 PM

Study:

Earned runs allowed by home pitchers were lowest on clear days at 3.93, climbing to 4.26 on cloudy days. For visiting pitchers the ERA was 4.50 in the clear and 4.68 under the clouds.
...
The analysis is based on statistics from 10,758 major league day games obtained from STATS LLC and weather data collected by the National Climatic Data Center, showing the conditions at the nearest National Weather Service office to each stadium at game time. The findings are published in the current issue of the journal Weather, Climate and Society.

Kent said he had expected to see better hitting in cloudy conditions but was surprised by how strong the effect was on strikeouts. Home pitchers averaged 6.65 strikeouts on clear days, but in cloudy conditions that fell to 6.22. For visiting pitchers, the drop from clear to cloudy was from 6.14 to 5.67.
...
On clear days, home teams won 56 percent of their games...When it was cloudy, that fell to 52 percent home wins…
...
Home teams had 0.98 home runs per game for clear days and 0.96 when it was cloudy. For visitors, the change from clear to cloudy was from 0.95 to 1.01. Home pitchers gave up 3.37 walks on clear days and 3.43 when it was cloudy. Visiting hurlers averaged 3.56 walks for clear days and 3.50 under clouds.

Anyone who’s seen an outfielder lose a ball in the sun won’t be surprised to hear there are more errors on clear days than cloudy ones.

The difference is largest for visiting teams who would not be accustomed to the glare and light angles in someone else’s stadium. Visiting teams averaged 0.80 errors on clear days and 0.73 on cloudy days. For home teams the decline was from 0.77 on clear days to 0.75 as it got cloudier.

If someone finds the paper, please post below.


#1    Elwin      (see all posts) 2011/05/04 (Wed) @ 14:57

Looks like it is in the January 2011 issue of Weather, Climate, and Society.

Here is a link to the abstract.


#2          (see all posts) 2011/05/04 (Wed) @ 15:14

Hope to see it on Scribd soon.


#3    MGL      (see all posts) 2011/05/04 (Wed) @ 23:46

Wow, that is fascinating!  I can and will duplicate the study, as I have STATS and Project Scoresheet “clouds” data for games going back at least 5 or 10 years…


#4    MGL      (see all posts) 2011/05/05 (Thu) @ 00:56

I wonder if they distinguished between earned runs and runs allowed.  If there are more errors on clear days that would lower earned run average actually while raising RA.

My guess is that most or all of the differences are because of this:

On sunny days, many stadiums have some kind of shadow situation at certain times of the year and certain times of the day which makes it really difficult for hitters to hit, which would greatly increase K’s and decrease all other offensive events.  On cloudy days there is no such thing of course, since there essentially no shadows.  Looking at innings and times would tease that out.  And of course you would also expect the home team to more familiar with their kinds of shadows, especially since some stadiums have shadows and others don’t, depending on the architecture of the ballpark…

As I said, I’ll see if I can look at some of this stuff as it fascinates me…


#5    MGL      (see all posts) 2011/05/05 (Thu) @ 01:04

In the article referenced, they found that night games had greater run scoring than did sunny day games and less scoring on cloudy day games. If you compare night to day games, you must control for the pool of offensive players.  I have found that if you do, overall, day games score more than night games presumably because it is warmer.  The reason that day and night games have about the same level of scoring (which they do) when you don’t control for the pool of batters is because day games have worse lineups because first string players, especially catchers, tend to get rested on day games after night games…


#6    MGL      (see all posts) 2011/05/05 (Thu) @ 07:03

There are lots of things you would have to control for of course.  You would have to control for precipitation.  For example, if some of the “cloudy” days had precipitation before or during the game, run scoring would be higher because run scoring is higher on wet fields.

You would obviously have to control for the park. If all your cloudy days and sunny days had two different pools of parks, which they likely do, then the numbers will mean nothing of course without that control.

You might even have to control for the pool of pitchers and batters, as well as the IP per game for the starting pitchers and batters.. withing the day games (cloudy or sunny).  Managers might tend to rest players in hot, sunny days rather than cool cloudy ones.  They also might select certain pitchers based on whether it is a sunny or cloudy day or yank their starter earlier (and end up with a worse long reliever) on a hot, sunny day.  And they might tend to take out their starters (batters) at the end of a blowout on a hot, sunny day.

You also have to control for temperature and month.  Maybe sunny days happen more on colder days or maybe certain months have more sunny days than other months (for example, April has higher run scoring than May, even though it is colder).

I sure hope that the authors controlled for these things.  I wrote one of them and asked him if he could share the full study with me.

In any case, I ran some preliminary data.  I did not control for anything.  I simply looked at all games that were not under a dome.  I split them into 3 categories, night games, day games that were sunny, and day games that were listed as cloudy or overcast (by retrosheet or Project Scoresheet).  I don’t know the difference between cloudy and overcast.  Maybe cloudy is lots of clouds and overcast is complete cloud cover.

I also eliminated all teams with retractable domes since they would tend to close the roof when it is cloudy and when it is hot (in HOU and ARI).

Here are the results for 98-10:

N games, Av temp, rpg, vrpg, hrpg, hwp

Night games

16,106 74.1, 9.54, 4.67, 4.87, .544

Day games

7834, 72.6, 9.65, 4.72, 4.93, .548 (Ed note: originally showed 9.55)

Day/sunny games

3222, 74.1, 9.65, 4.73, 4.92, .544

Day/cloudy games

4337, 71.5, 9.64, 4.70, 4.94, .552

So, according to this data, and with no (requisite) controls whatsoever, the runs per game in the cloudy games are less than that of the sunny days, which conflicts with the above study as far as we can tell, however…

Given the temperature difference between the cloudy and sunny days, it looks like that on cloudy days slightly more runs are scored after controlling for temperature (around .014 marginal rpg are scored per 1 degree change in temperature), although I am pretty sure that the difference is not even close to being statistically significant (I don’t know the standard error for rpg off the top of my head).

As far as home field advantage, it appears that my data is completely the opposite of their data!  I get that the HFA is much greater on cloudy days than on sunny days, although the difference is less than 1 standard error.  They get the opposite.

Seriously, do we ever get results consistent with these non-subject matter experts?

Further data from me will be forthcoming…


#7    Peter Jensen      (see all posts) 2011/05/05 (Thu) @ 07:33

MGL - Why are your runs per game in all day games (9.55 in 7834 games) so much lower than the totals for runs per game for cloudy and clear day games (9.64 in 7559 games)?  That would make the run scoring in the missing 275 games 6.96 per game.  That can’t be right, can it?


#8          (see all posts) 2011/05/05 (Thu) @ 10:19

MGL - Since you have the data right now, can you publish which parks have higher K and BB rates in the day (possible shadow between P and C) than at night.


#9    MGL      (see all posts) 2011/05/05 (Thu) @ 13:59

Peter, good question. I’ll check later.  Those extra games would be day games that were not coded as either cloudy or sunny.

Jeff, I’ll have things like that when I break down the data further and control for all the things I mentioned in my last post.

As I said, I think this is fascinating stuff, and as usual, you really need a subject matter expert to do (or to assist with) this kind of research…


#10          (see all posts) 2011/05/05 (Thu) @ 15:00

Here’s a link to the article: http://www.sabometrics.com/cloudcover.pdf

I haven’t had a chance to read it yet but I’m hoping to later this evening.


#11    Tangotiger      (see all posts) 2011/05/05 (Thu) @ 15:32

By the way, those guys are using reliable weather data, while Retrosheet simply has, well, whatever the stringer thinks, and whatever he happened to record (accurately or not).

Ideally, someone would have linked Wunderground historical data, and those guys seemed to have done something like that.


#12    MGL      (see all posts) 2011/05/05 (Thu) @ 17:15

As far as the weather data, while they are using reliable weather information, it is from the nearest airport.  The stringers are at least at the game or watching on video. How hard can it be to tell whether the sky is clear, party cloudy, or mostly cloudy?  Plus, for this kind of study, I wouldn’t think that accuracy is all that critical and you would not expect much bias, although you never know.

Anyone know where you can get historical weather databases without paying an arm and a leg for?

I read the study.  They basically looked at each stadium individually and ran 3 or 4 different regression analyses.  So in essence they did control for park, although I am not sure how they combined the results.  Mainly they just told us in how many parks did the relationship between the dependent and independent variables go one way and in how many did it go the other way. 

Then they told us if that ratio were statistically significant using the binary model (like for example, if in 25 parks, it went more one way more than 70% of the time, that would be significant at the 2 sigma level - since one SD is 10%).

They also talked a lot about whether the differences found by the regressions were significant at each park, but I question that kind of results reporting.  If you are running tests on many parks, say, 25 or 30, can you report the significance of the results for each park independently?  Isn’t that data mining? Don’t you have to adjust your significance levels for the number of parks you looked at?

I don’t know whether they did that or not.  For example, if I looked at 25 parks, it would be likely that at least one had a significant result at the 2 sigma level (2.5% for a one-tail test) by chance alone.

Anyway, there is lots of data which is presented kind of in a confusing manner, and I really have no idea what to make of it overall.

As usual, using regressions (they uses regular and binary) makes things very difficult to follow.  I would never do that for a study like this.  I would simply report the results for each category while controlling or adjusting for each of the variables I think could be confounding.

As far as I could tell, they did NOT address some of those variables, like the batter and pitcher pools, the temperature, and precipitation, either before or during the game.

As I said, I will try and run some data later today.  I have a feeling I am not going to find much…


#13    Tangotiger      (see all posts) 2011/05/05 (Thu) @ 17:20

MGL, there are also transcription errors.

As for weather data, here it is, for free:
http://tangotiger.net/sabrmatt/

Courtesy of Wunderground and Matt.

I think someone started mapping them to Retro games, but I’m not sure.  They mentioned some airports were missing.


#14    MGL      (see all posts) 2011/05/05 (Thu) @ 17:54

Wow, thanks for the database!  I’m not sure what you mean by transcription errors, but I wouldn’t think there would be much bias in them, and how many errors (percentage-wise) can there be?  We can say that about anything.  And the retrosheet files are pretty darn meticulous, from my experience.

I have no issues doing a study like this using retrosheet weather data. None whatsoever…


#15    tangotiger      (see all posts) 2011/05/05 (Thu) @ 18:49

In terms of transcription errors, there was 10 degrees in Texas.  As it turns out, it should have been 100 degrees.

You will get sloppy stuff like that because they don’t check that against something else, unlike so total HR hit in game.

In any case, since you are going to debunk the other study, the other researcher can simply claim he was true to his source of data (which may be more reliable), and therefore, can legitimately say that both studies are valid.

Obviously, he should release ALL his data.  Why is this not standard practice?  I’ll even host it on my site. I mean, Sakes/Hauer gave me their data, I posted it, and it was pretty easy to see where they went wrong once I went through the data.


#16    Nathaniel Dawson      (see all posts) 2011/05/05 (Thu) @ 19:55

Peter:

Day games

7834, 72.6, 9.55, 4.72, 4.93, .548

4.72 plus 4.93 = 9.65

I wondered the same thing when I first looked at it, but it’s just a typo.


#17    Tangotiger      (see all posts) 2011/05/05 (Thu) @ 20:00

I’ll fix MGL/6 to reflect that.


#18    MGL      (see all posts) 2011/05/05 (Thu) @ 21:23

One of the authors, Prof. Sheridan, was kind enough to send me a link to their weather data!

FWIW, if I find a ridiculous looking number, like 10 degrees in July in Texas, I discard it.  If an occasional 78 gets transcribed as 87, I don’t think it is going to change anything.

For the record, I have not debunked their study at all, at least until I control for park, which seems to be the only thing they controlled for.

Maybe someone with more statistical acumen than I can explain (in English) exactly what they did in the study.  Oh how I hate regression when it is not necessary…


#19    MGL      (see all posts) 2011/05/05 (Thu) @ 21:29

Sorry, I did have a typo. Actually the total number is 9.64 and not 9.65 for all the day games.  If you add the hscore and vscore, it is 9.65 only because of rounding…


#20    MGL      (see all posts) 2011/05/06 (Fri) @ 02:41

OK, I controlled for park by using the delta method for temperature, total runs scored, home runs scored, visitor runs scored, and home winning percentage.

For example, here is what I did:

If in park 1 the sunny days had 9.5 rpg in 50 games and the cloudy days had 9.0 rpg in 30 games, for a total of 80 games at 9.3125 rpg, I took 9.5 minus 9.3125 weighted by (times) 50 games, which is +9.375.  That is the “delta” for the sunny games in that park.  Let’s say that for the cloudy games in that park, the delta is 9.0 minus 9.3125 times 30 (games), which is -9.375.

For park 2, I did the same thing.  Let’s say that their sunny delta was +6.0 in 20 games and their cloudy delta was -3.0 in 30 games.

So the average sunny difference (from the average day game), controlling for park (by using the delta method), is (9.375 + 6.0) / (50 +20).  And the average cloudy difference is (-9.375 - 3.0) / (30 + 30).

No need for regressions!

Anyway, here is the real data, 1998 to 2010 all games that were not played under a dome:

night games

9.530 4.663 4.867 .545 73.6

day games

9.652 4.730 4.922 .547 73.6

sunny day games

9.726 4.786 4.941 .542 75.8

cloudy day games

9.593 4.686 4.907 .552 71.9

So we have lots of different data than from their study.  The game time temperature in day and night games are the same, but in day games .122 more rpg are scored despite presumably having worse lineups.  I assume that is because it is easier for the batter to see in day games, more balls might get lost in the sun than in the lights, and it gets warmer as the game goes on.

For day games, there is more scoring in sunny conditions than cloudy ones.  However, that is pretty much consistent with the warmer temperatures, almost 4 degrees.  Plus there may be a sun effect on fly balls (and sometimes throws to first base).

Again, in the study, they say that the HFA is higher on sunny days than on cloudy days, presumably because the home team knows how to play the “glare” from the sun better than the visiting team.  I get exactly the opposite result - on cloudy days, the HFA is 10 points higher.  I have no idea why that might be.  It is probably just a random fluctuation, as probably is their opposite result.

I doubt that the integrity of the data has much influence on the result, as I already discussed…


#21    Parrish      (see all posts) 2011/07/30 (Sat) @ 18:51

If you are trying to determine the actual impact of cloudcover vs sun, why use runs at all. I’d concentrate on component stats like K% per PA, BB% per PA, or for lost in the sun balls maybe (error+hits)/flyball% But wind gust speed would probably be the most important factor there, not sun. I’ve also seen (it seems) just as many balls lost in the lights as lost in the sun.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 13:18
Do pitcher’s reach back for velocity when needed?

May 25 13:04
“Why Kickstarter works”

May 25 12:51
Chad Curtis

May 25 12:40
Largest demonstration in Canadian history?

May 25 11:32
Howard Stern

May 25 11:26
Lack of hustle during a game

May 25 11:22
What sabermetrics is NOT

May 25 10:58
Rooting for laundry

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves