Monday, February 18, 2008
I Could Use Some Help With Some Data
A few months ago, I wrote an article for THT, in which I looked essentially at two things: One, how speed, as measured by a Bill James-like speed rating correlates with defense, as measured by UZR (a lot, especially in the OF as would be expected), and whether fast players did comparatively better than slow players in large outfields, as is assumed according to CW.
In the article, here is what I wrote about the latter issue:
Average difference in UZR per 150 between a player in a small park and all other parks
Slow players: +5.0
Fast players: +.9
Average difference in UZR per 150 between a player in a large park and all other parks
Slow players: +3.3
Fast players: +7.6
As you can see, fast players do indeed have a much bigger advantage in large parks and slower players have an advantage in small parks, just as conventional wisdom would suggest. In a small park, a fast player basically achieves the same UZR score (1 run better) as when he plays in any other park. For a slow player, his UZR score improves by five runs, or four runs more than the fast player.
In a large park, while both types of players “improve” over all other parks, the slow player improves only by a little more than three runs, while the fast player improves by almost eight runs, or four runs more than the slow player.
Now, because of sample size issues and the potential problems described above (possible scorer bias in large and small parks, etc.), it is not an ironclad conclusion that a team with a large outfield can benefit from fast outfielders and that a team with a small outfield can “hide” poorer (slower) defenders, but the data seem to suggest that that is the case.
There are two things that added a lot of noise to my data, which I have wanted to correct ever since I did the research for that article. One, I wanted to add more years to increase my sample size (I only used 07 for the article), and two, I wanted to break down the OF into 3 sections and treat each section separately. In the original analysis, if an entire OF was a “large” one, I assumed that RF, CF, and RF were large and lumped all outfielders, RF, CF, and LF, into the “large” bucket. In reality of course, parks may have a small LF, large RF, average LF, large CF, or any combination thereof, although there obviously will be some correlation between the size of each outfield section in each park.
Anyway, I corrected those two things. One, I now compiled data from 04 to 07. Two, I broke each park down into 3 equal sized sections (30 degrees each) and measured the square footage of each section and then lumped things into 9 possible buckets - small LF, small, CF, small RF, average LF, etc.
Here is what I got (all parks not listed, which are the majority, are around average):
LF
small
BOS, CLE, HOU, PHI, TOR
large
TEX, ANA, COL, KCA, MIL, WAS, PIT
CF
small
ANA, BOS, MIL, CHN
large
DET, COL, KCA, HOU, WAS, TBA
RF
small
BAL, PHI, MIN, MIL, HOU, PIT, SEA, CHA
large
ARI, DET, COL, BOS, KCA, WAS, CHN
I again left COL out of all the data collection, as well as all COL players in other parks. Data from COL and from COL players seems to screw up just about everything.
BTW, Petco park, despite all the posturing from management, media, and fans, does NOT have an overly large OF, even before the fences were slightly moved in in 2006. It is slightly above average in size, but the reason for the low HR factor is that an average sized park at sea-level and with relatively cool temperatures, will generally play like a pitcher’s park. Plus, HR factors are mostly determined by the distance in the alleys (where most HR’s are hit) and include the height of the walls of course, and not just the area of the OF (distance to the fences).
Anyway, I computed the UZR of all fielders in large parks, small parks, average parks, and of course, all parks. For a fielder playing RF, a “large” park was a park that had a large RF, etc. So, for example, “Griffey in a large park” was Griffey when he played RF in a park with a large RF and when he played CF in a park with a large CF.
Just as in the original article, I also computed a speed score for all players based on triples rate, baserunning lwts, SB att. rate, and SB success rate, in 04-07. The speed score was from 1-5, in tenths of an integer (e.g., 2.7, 3.2). If a player was 2.5 or less, I called him slow. If he was 3.5 or more, I called him fast. I did not include the “medium speed” players in most of the study. So now we have these 8 buckets:
1) slow players in small parks.
2) slow players in average parks.
3) slow players in large parks.
4) slow players in all parks.
5-8 for fast players in small, average, large, and all parks.
Again, all for 04-07. BTW, while there were a few changes and new parks in that time period, it does not change the classification of whether a park had a small, large, or average LF, CF, or RF, other than for MON/WAS. In 04, they played in Olympic and some games in PR. I accounted for that in compiling the data.
Again, by “small (or large, etc.) parks,” I mean a small RF, CF, or LF, depending upon what position the player was playing when the data is put into one of the buckets.
Now, it is a simply a matter of looking at “deltas” like we do with lots of these kinds of studies, to see the average difference between fast players playing in small or large parks, and slow players playing in small or large parks. The hypothesis, at least according to CW, is that fast players will see a “benefit” in large parks, as compared to slow players. For example, let’s say that slow players lose 5 runs in UZR (per 150) between a small and large park (they do better in small parks), we would expect a fast player to lose less than 5 runs. Or, let’s say that fast players do exactly the same in small or large parks, we would expect that the slow players would lose some runs going from a small to a large park.
To cut to the chase, while in the original research, using only 07 data, and not breaking parks down into LF, CF, and RF, I found CW to be supported by the data, I found just the opposite with the new research using the more robust data.
I don’t have the time to debug all of my programs and methodology. I am putting all of the relevant data here, and I am wondering if someone can duplicate the study using my raw data and see if they come up with the same results. It is pretty simple. All you have to do is the “delta” method, matching up a “fast player in the small park” bucket with the same player in the “large field bucket” and take the difference in UZR per 150 and weight that (multiply by) number by the lesser of the two “number of chances”. Then add these numbers up for all fast players, and the same for all slow players and see what the “average weighted difference is.” There are two files in which I did exactly that, but I would like someone to check my data from the raw files independent of these files. If the difference between small and large parks (small minus large) is negative, that means that the group of players does better in large parks, and vice versa. If CW is correct, then small minus large for slow players should be greater than small minus large for fast players. I got just the opposite, as I said. Small minus large was greater for the fast players, suggesting that they do “better” in small parks, as compared to slow players who “better” in large parks.
Here are some of the numbers I came up with (these are all OF’ers of course):
Speed AVG UZR per 150
Slow = -.3.2
Fast = 2.2
All others = -.6
Park sections UZR
small = -.8
large = .6
all = 0.0
Keep in mind that these numbers reflect the home players that tend to play in small, large, and average sections of the OF. It does NOT mean that small sections of the OF are harder to play in because the average UZR of all players in those sections is -.8 and it is .6 in large sections. The pool of players in small and large sections is different.
BTW, the average outfield section PF for small, large, and average OF sections is not significantly different. The way that I do my park factors make it such that, in general, OF sections with high walls have the lowest PF’s (like Fenway) because balls off the wall are counted as “balls in the park” that are not caught. If the darn data told me whether a ball was hit high off a wall, I could treat it the same as a HR, but alas it does not.
Park Speed UZR
small slow -9.7
small fast 3.3
small medium -.5
large slow .4
large fast 2.6
large medium -2.1
all slow -2.9
all fast 2.1
all medium -.5
Here are the important results I got:
If I take all the slow players and subtract each player’s difference between small and large parks (ignoring average ones), weighted by the minimum of the chances in each matched pair, I get -4.7. That is small minus large park. So that means that the slow players GAIN 4.7 runs per 150 when “going” from a small to a large park.
For the fast players, it is +.6 runs, which means that they LOSE .6 runs when going from a small to a large park. So, as I said, the slower players do a lot better than the faster players in a large park relative to a slow one, exactly the opposite of CW and perhaps intuition. The difference is over 5 runs per 150.
Can someone run with the data? Thanks. Here is the data from which I classified sections of the outfield in each park as small, large, or average.
Here is the rest of the data. BTW, does anyone know if you can link to more than document in Google docs without having to link to more than one address?
04 UZR in Small OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8i17xp_amvfAQ
04 UZR in Large OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8icPAiMJC7wXQ
05 UZR in Small OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8j5-a8m__FQRw
05 UZR in Large OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8j0FzCnGa6nVA
06 UZR in Small OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8jWm_ggpFXRLg
06 UZR in Large OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8hPCfAQDe0cLQ
07 UZR in Small OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8jsWT7xYN6yWg
07 UZR in Large OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8hdJR_bB-tZag
Slow Players UZR in “matched” small and large parks
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8hYtAniiOJwBQ
Fast Players UZR in “matched” small and large parks
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8g8kVZUQk9eQw
BTW, does anyone know if you can link to more than document in Google docs without having to link to more than one address?
One way around this might be that instead of putting each data set in a separate spreadsheet, you could use one spreadsheet and put separate data sets in different ‘sheets’. I am not sure how much data total is allowed in a spreadsheet though.