THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, February 18, 2008

I Could Use Some Help With Some Data

By , 03:02 AM

A few months ago, I wrote an article for THT, in which I looked essentially at two things:  One, how speed, as measured by a Bill James-like speed rating correlates with defense, as measured by UZR (a lot, especially in the OF as would be expected), and whether fast players did comparatively better than slow players in large outfields, as is assumed according to CW.


In the article, here is what I wrote about the latter issue:

Average difference in UZR per 150 between a player in a small park and all other parks

Slow players: +5.0
Fast players: +.9

Average difference in UZR per 150 between a player in a large park and all other parks

Slow players: +3.3
Fast players: +7.6

As you can see, fast players do indeed have a much bigger advantage in large parks and slower players have an advantage in small parks, just as conventional wisdom would suggest. In a small park, a fast player basically achieves the same UZR score (1 run better) as when he plays in any other park. For a slow player, his UZR score improves by five runs, or four runs more than the fast player.

In a large park, while both types of players “improve” over all other parks, the slow player improves only by a little more than three runs, while the fast player improves by almost eight runs, or four runs more than the slow player.

Now, because of sample size issues and the potential problems described above (possible scorer bias in large and small parks, etc.), it is not an ironclad conclusion that a team with a large outfield can benefit from fast outfielders and that a team with a small outfield can “hide” poorer (slower) defenders, but the data seem to suggest that that is the case.

There are two things that added a lot of noise to my data, which I have wanted to correct ever since I did the research for that article.  One, I wanted to add more years to increase my sample size (I only used 07 for the article), and two, I wanted to break down the OF into 3 sections and treat each section separately.  In the original analysis, if an entire OF was a “large” one, I assumed that RF, CF, and RF were large and lumped all outfielders, RF, CF, and LF, into the “large” bucket.  In reality of course, parks may have a small LF, large RF, average LF, large CF, or any combination thereof, although there obviously will be some correlation between the size of each outfield section in each park.

Anyway, I corrected those two things.  One, I now compiled data from 04 to 07.  Two, I broke each park down into 3 equal sized sections (30 degrees each) and measured the square footage of each section and then lumped things into 9 possible buckets - small LF, small, CF, small RF, average LF, etc.

Here is what I got (all parks not listed, which are the majority, are around average):

LF

small

BOS, CLE, HOU, PHI, TOR

large

TEX, ANA, COL, KCA, MIL, WAS, PIT

CF

small

ANA, BOS, MIL, CHN

large

DET, COL, KCA, HOU, WAS, TBA

RF

small

BAL, PHI, MIN, MIL, HOU, PIT, SEA, CHA

large

ARI, DET, COL, BOS, KCA, WAS, CHN

I again left COL out of all the data collection, as well as all COL players in other parks.  Data from COL and from COL players seems to screw up just about everything.

BTW, Petco park, despite all the posturing from management, media, and fans, does NOT have an overly large OF, even before the fences were slightly moved in in 2006.  It is slightly above average in size, but the reason for the low HR factor is that an average sized park at sea-level and with relatively cool temperatures, will generally play like a pitcher’s park.  Plus, HR factors are mostly determined by the distance in the alleys (where most HR’s are hit) and include the height of the walls of course, and not just the area of the OF (distance to the fences).

Anyway, I computed the UZR of all fielders in large parks, small parks, average parks, and of course, all parks.  For a fielder playing RF, a “large” park was a park that had a large RF, etc.  So, for example, “Griffey in a large park” was Griffey when he played RF in a park with a large RF and when he played CF in a park with a large CF.

Just as in the original article, I also computed a speed score for all players based on triples rate, baserunning lwts, SB att. rate, and SB success rate, in 04-07.  The speed score was from 1-5, in tenths of an integer (e.g., 2.7, 3.2).  If a player was 2.5 or less, I called him slow.  If he was 3.5 or more, I called him fast. I did not include the “medium speed” players in most of the study.  So now we have these 8 buckets:

1) slow players in small parks.
2) slow players in average parks.
3) slow players in large parks.
4) slow players in all parks.

5-8 for fast players in small, average, large, and all parks.

Again, all for 04-07.  BTW, while there were a few changes and new parks in that time period, it does not change the classification of whether a park had a small, large, or average LF, CF, or RF, other than for MON/WAS.  In 04, they played in Olympic and some games in PR.  I accounted for that in compiling the data.

Again, by “small (or large, etc.) parks,” I mean a small RF, CF, or LF, depending upon what position the player was playing when the data is put into one of the buckets.

Now, it is a simply a matter of looking at “deltas” like we do with lots of these kinds of studies, to see the average difference between fast players playing in small or large parks, and slow players playing in small or large parks.  The hypothesis, at least according to CW, is that fast players will see a “benefit” in large parks, as compared to slow players.  For example, let’s say that slow players lose 5 runs in UZR (per 150) between a small and large park (they do better in small parks), we would expect a fast player to lose less than 5 runs.  Or, let’s say that fast players do exactly the same in small or large parks, we would expect that the slow players would lose some runs going from a small to a large park.

To cut to the chase, while in the original research, using only 07 data, and not breaking parks down into LF, CF, and RF, I found CW to be supported by the data, I found just the opposite with the new research using the more robust data.

I don’t have the time to debug all of my programs and methodology.  I am putting all of the relevant data here, and I am wondering if someone can duplicate the study using my raw data and see if they come up with the same results.  It is pretty simple.  All you have to do is the “delta” method, matching up a “fast player in the small park” bucket with the same player in the “large field bucket” and take the difference in UZR per 150 and weight that (multiply by) number by the lesser of the two “number of chances”.  Then add these numbers up for all fast players, and the same for all slow players and see what the “average weighted difference is.” There are two files in which I did exactly that, but I would like someone to check my data from the raw files independent of these files.  If the difference between small and large parks (small minus large) is negative, that means that the group of players does better in large parks, and vice versa.  If CW is correct, then small minus large for slow players should be greater than small minus large for fast players.  I got just the opposite, as I said.  Small minus large was greater for the fast players, suggesting that they do “better” in small parks, as compared to slow players who “better” in large parks.

Here are some of the numbers I came up with (these are all OF’ers of course):

Speed AVG UZR per 150

Slow = -.3.2
Fast = 2.2
All others = -.6

Park sections UZR

small = -.8
large = .6
all = 0.0

Keep in mind that these numbers reflect the home players that tend to play in small, large, and average sections of the OF.  It does NOT mean that small sections of the OF are harder to play in because the average UZR of all players in those sections is -.8 and it is .6 in large sections.  The pool of players in small and large sections is different.

BTW, the average outfield section PF for small, large, and average OF sections is not significantly different.  The way that I do my park factors make it such that, in general, OF sections with high walls have the lowest PF’s (like Fenway) because balls off the wall are counted as “balls in the park” that are not caught.  If the darn data told me whether a ball was hit high off a wall, I could treat it the same as a HR, but alas it does not.

Park Speed UZR

small slow -9.7
small fast 3.3
small medium -.5

large slow .4
large fast 2.6
large medium -2.1

all slow -2.9
all fast 2.1
all medium -.5

Here are the important results I got:

If I take all the slow players and subtract each player’s difference between small and large parks (ignoring average ones), weighted by the minimum of the chances in each matched pair, I get -4.7.  That is small minus large park.  So that means that the slow players GAIN 4.7 runs per 150 when “going” from a small to a large park.

For the fast players, it is +.6 runs, which means that they LOSE .6 runs when going from a small to a large park.  So, as I said, the slower players do a lot better than the faster players in a large park relative to a slow one, exactly the opposite of CW and perhaps intuition.  The difference is over 5 runs per 150.

Can someone run with the data?  Thanks.  Here is the data from which I classified sections of the outfield in each park as small, large, or average.

Here is the rest of the data.  BTW, does anyone know if you can link to more than document in Google docs without having to link to more than one address?

04 UZR in Small OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8i17xp_amvfAQ
04 UZR in Large OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8icPAiMJC7wXQ
05 UZR in Small OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8j5-a8m__FQRw
05 UZR in Large OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8j0FzCnGa6nVA
06 UZR in Small OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8jWm_ggpFXRLg
06 UZR in Large OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8hPCfAQDe0cLQ
07 UZR in Small OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8jsWT7xYN6yWg
07 UZR in Large OF Sections
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8hdJR_bB-tZag
Slow Players UZR in “matched” small and large parks
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8hYtAniiOJwBQ
Fast Players UZR in “matched” small and large parks
http://spreadsheets.google.com/pub?key=p4mB-r5bxU8g8kVZUQk9eQw

#1    SirKodiak      (see all posts) 2008/02/18 (Mon) @ 09:16

BTW, does anyone know if you can link to more than document in Google docs without having to link to more than one address?

One way around this might be that instead of putting each data set in a separate spreadsheet, you could use one spreadsheet and put separate data sets in different ‘sheets’.  I am not sure how much data total is allowed in a spreadsheet though.


#2    SirKodiak      (see all posts) 2008/02/18 (Mon) @ 09:48

Another way is to just make a Document like this:
http://docs.google.com/Doc?id=dgt4h7mx_1fc9jtkch

When you publish it, it asks if you want to publish it to a blog and allows custom blogs, so this might be useful in the future.


#3    Bobby Swift      (see all posts) 2008/02/18 (Mon) @ 15:39

Very interesting, I’ll take a look at it. The only thing that worries me is that OF are not randomly assigned to small or large parks. Teams with large parks might target speedy OF, and teams with small parks might have no problem signing slow, bad OF. I’m not sure how to correct for this selection problem.


#4    MGL      (see all posts) 2008/02/18 (Mon) @ 18:32

Bobby, yes that may be true, and probably is, but it should not matter when doing “matched pairs.” All I am doing is looking at all players who played some games in a small OF and some games in a large OF and accumulating the difference between their UZR in the small OF and the large OF (weighted by the lesser of the two sample sizes).  It should not matter if better or faster players tend to play in large home OF’s or not.  Do you see what I mean?  The overall numbers I gave in the post reflect a possible bias or selection problem but the “delta” numbers shouldn’t.

Tango, you do these “matched pair” studies all the time.  Any chance you can check my numbers from the raw data files (the xxuzrsmall and xxuzrlarge files)?  It should not take that long, I don’t think.


#5    Bryan      (see all posts) 2008/02/19 (Tue) @ 01:44

It might be easy to do but I’m a little CPU inept.  It might be good to just look at the middle players too.  If they follow the same trend you show above then that reinforces it a little.  If the average speed players don’t fall between the extremes then you’ve probably just got too much noise in the system to see anything significant.


#6    MGL      (see all posts) 2008/02/19 (Tue) @ 04:25

Yeah, I’m worried about the noise as it is.


#7    Excalabur      (see all posts) 2008/02/19 (Tue) @ 17:10

I’m kind of surprised at the implication that TOR’s LF counts as “small” but the RF isn’t, given that the SkyDome is symmetrical.  Are MLB parks much more likely to have large RFs rather than LFs in square footage terms?


#8    MGL      (see all posts) 2008/02/19 (Tue) @ 18:35

I don’t have the numbers in front of me (you can add them up from the docs I linked), but yes, I think that the average RF is quite different from the average LF.

Plus, I should have averaged LF and RF for all symmetrical parks, I guess, but I didn’t think of that.  There is definitely measurement error when using the screen tracing program to measure the parks.  Plus I am not 100% sure that all the park diagrams I use are drawn exactly to scale. They come from Clem’s Baseball page.  In it, he claims that the diagrams are very well constructed.

I only classified a section as large or small if it was at least 1500 square feet more or less than average, which corresponds to about an extra 5 feet in distance to the wall.  Maybe the problem is that the difference between parks is not all that great.  When people talk about large or small outfields, especially the former, you would think that an outfielder would need a horse to get to the wall.  But as I just said, the typical large section as compared to an average one, is 5-10 feet of difference to the wall.  If there is an effect wrt to fast and players, maybe it is not enough to make up for the noise (realtively small samples) in the data.  With relatively small samples, you can only hope that a suspected (or not-suspected) effect, if it exists, shows up large enough.  If it does exist, sometimes it shows up and sometimes it does not (Type I error).  Obviously the smaller the sample, the more it tends not to show up (to any degree) even if it exists. Such are the perils of working with sample data.


#9    MGL      (see all posts) 2008/02/19 (Tue) @ 18:50

According to my numbers, the average RF, including Coors, is 34,000, and LF is 33,700, so that would tend to make a symmetrical park be smaller (relatively speaking) in RF and not in LF.  But I also have Skydome as 1,000 feet smaller in LF than RF, which is obviously incorrect, as you point out.  One thing I have to do is double check my parks and certainly correct the symmetrical ones to at least make sure that RF and LF are the same.  Thanks!


#10    Excalabur      (see all posts) 2008/02/19 (Tue) @ 21:59

Hm. 

Makes me wonder if your area-measuring methodology is a significant error source if the Skydome can come out 1000 sq. ft. different in LF and RF.


#11    MGL      (see all posts) 2008/02/20 (Wed) @ 02:01

Well, it is obviously SOMEWHAT of an error source.  I doubt it is that much if an error souce, especially since I don’t care exactly how big any area is, as long as I get most of the categorization right, which I am absolutely certain that I did.  For that, all you have to do is look at the diagram of all the parks (to get MOST of them right).

Not to defend the measuring (actually, my son did most of it), but picking out a likely worst case scenario is not an indication of how bad or good the overall situation is.  Now, if we randomly chose 3 parks and carefully redid them and found out that I was an average of 1000 feet off per park, now THAT would be a problem.  Getting a bad error from a measurement that someone pointed out made no sense does not tell us a whole lot about the overall error rate.  In fact, I could have predicted a high likelihood of at least one park being significantly screwed up.

In any case, it doesn’t really matter. I’ll redo the measurements and redo the calcs.  I’m pretty sure that it won’t change anything much, but you never know.

I sure would like someone to double check the calcs though.  All the data is there.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Aug 31 15:28
Fans Scouting Report: Update

Sep 02 15:38
The two uncertainties of UZR

Sep 02 15:17
Mail: rWAR v fWAR

Sep 02 14:59
Roger Federer

Sep 02 14:59
It’s hard to beat the crowd (Vegas in this case) no matter how smart you think you are

Sep 02 14:57
Could Rob Dibble have been a comp for Strasburg?

Sep 02 14:15
WOWY Teachers

Sep 02 13:37
Who’s Waldo?

Sep 02 08:36
Team Elin

Sep 02 01:19
Can someone tell me why Trevor Hoffman is still allowed to pitch?