Friday, January 23, 2009
Best OF arms, 2008
John’s annual update.
Buy The Book from Amazon
John’s annual update.
Xei:
Well, the article presumes that each OF faced an even quality of baserunners and batters. That seems to be fair.
And, that the results confirm what our eyes sees is also another good indication that we don’t need to prove that the quality of opponents is not necessary to ascertain.
Finally, as for your proposal, it only measures one part of the “baserunner kills”, his arm strength and accuracy. You need to measure how long it takes for him to get the ball, and how quickly he can transfer the ball from glove to release.
All in all, of all the fielding articles to criticize in terms of holes or sampling bias, this is one of the least ones to concern yourself about.
Yes, there are many other things besides strength and accuracy (that timing the ball with an adjustment for accuracy would produce) that goes into how “productive” an arm is (which is all we pretty much care about), not the least of which is knowing when to throw to what base. Another factor is throwing low enough to hit the cutoff man so the infielders can decide if and when to let the throw through and risk the batter and trailing runners take the extra base.
The best way to do this is exactly how the author did this. He can improve the model by incorporating where the ball was hit, what type of hit it was, and how hard it was hit (and the speed of the baserunner).
Lo and behold, that is exactly what I do (other than adjusting for the speed of the base runner which I might add at some point), and my numbers should be up on Fangraphs shortly!
Of course, if you knew the strength and accuracy of each outfielder’s arm, you could use that number to improve the model, just like you can use the pure speed of the outfielder to improve a defensive model like UZR (by simply regressing UZR to a mean of the “population with that speed").
My outfield arm numbers, BTW, are almost exactly the same as these, even with the refinements.
Bottom line, in case anyone wants to know, is that for RF, you are looking at plus or minus 10 runs (2.5 SD I would think), in center, 7 or 8 runs runs, and in left, 4 or 5 runs. That is for one year only, which includes sample error. For a career or “true talent-wise”, you are probably looking at 5, 3, and 2 runs, for 2-3 SD in true talent level, per season, for RF, CF, and LF.
yay for MGL’s OF arms at FanGraphs…
(all that’s left is baserunning)
...although they need to leave something for THT and other free access places to do. It’s all good.
MGL (or anyone else), just curious, when projecting a player’s “arm performance,” do the same rules of thumb apply as for defensive range, That is, at least three years of data before we can begin to draw conclusions about how good/bad a player’s arm is?
No way! For arms, you’d probably need just half a season. And if they incorporated the Fans Scouting Report, they’d probably need half of that.
Tangotiger, would you mind elaborating a bit on why you think that you would only need half of a season? Looking at John Walsh’s data, it seems that in half of a season many outfielders only get 50-70 opportunities, which is a fairly small sample size in general for a baseball statistic. Why would we be so confident that we could draw reasonable conclusions from that little data?
Similar to needing so little data to establish someone’s K rate skill.
2. Of course how quickly you get to the ball is important, and how long it takes to release the ball etc..., but I thought we were talking about “Best Arms”. If you are talking about “Best Arms”, don’t you have to throw out how long it takes the fielder to get to the ball? I am willing to live with how quickly the player takes in releasing the ball (ie - throwing motion), but not how quickly you get to the ball, because that is now measuring something else completely. Matt Kemp is very fast on his feet and would most likely get a large bonus in this kind of “Best Arm” study due to his foot speed. Adding foot speed is fine, and necessary in an overall study of defense, but it seems out of place for a “Best Arm” study imho.
vr, Xei
Additional things I would control for (but should even out with larger sample sizes) are
1. Score differential
2. Outs
Tango/5,7 - I haven’t done a careful study yet, but I think you need much more than 1/2 season to be confident of the results. These guys are throwing out 10 runners a year, can you establish a K rate after 10 strikes? (The analogy is approximate, but you get the idea.)
Brian/9 - I do control for outs, well I distinguish between 2 outs/fewer than 2 outs. Runners advancement attempts increase drastically when there are 2 outs. The differences between 0 and 1 out are much less.
Certainly some skills take less time to establish than others, but we can’t really know how well a pitcher throws strikes after 100 pitches, so it still seems to me that the pool of data is a little too small for half a season of data to be enough.
Xei: semantics then. John’s study “Best Arms” clearly includes the entire act of getting and throwing the ball, and not just a “throwing arm” contest. What term do you think would be less ambiguous then?
***
You guys make a good point. It’s clear that you don’t need 400 throws to be equivalent to 400 catches, in terms of reliability. The reason is that there is a high variance in “kill rates” and a low variance in “out rates” (because it’s impossible to survive otherwise).
The question therefore is how many throws would be equivalent. I took a WAG that you need 0.5 seasons in throwing numbers to match 2.0 seasons in catching numbers. I could be wrong.
The reason you need alot more than 100 pitches from a pitcher than 100 “kill chances” for an OF is because a pitcher MUST be great to begin with. So, the variance among pitchers is superlow. There is no pitcher who is as horrible as Juan Pierre. Almost all pitchers are like Ichiro.
If all you had in the OF was Ichiro and Beltran, etc, then for sure you’d need alot more kill opps to know how good someone was.
That’s why you don’t need alot of PA to figure out a K rate… it’s simply not something a player is selected much on (there’s other ways to be successful with a high K rate or low K rate), and so you have high variance. And when the pop has a high variance, you will get high correlation.
Sorry, but I think I’m missing what’s going on the discussion that I sort of started. I guess my question was about projecting for the future. If 1/2 season tells the story, then which half-season do we pick for e.g., Mark Teahen? I don’t know the “splits” for each season, but in 2007 his Runs200 in RF was 5.3, while in 2008 it was -8.3. Did we have the “real” Teahen in 2007, or in 2008, or do we need more data? Maybe I asked the question the wrong way the first time, or this time…
The high variance in kill rates would be important if it were actually due to differences in the skill of the outfielder but it is not likely that it is. Take right fielder kills on singles with a man on first and no runner on 2nd. There were only 29 in 2008. Less than 1 per team. Only 16 were the result of a direct throw from the right fielder and only 8 of those were on the lead runner. That means that the other 8 were on the batter at 2nd and the man on 1st went to third without a throw. So the right fielder would get a no hold and a kill on the same play. The 13 other kills were 4 oouts at home after relays from the cutoff, 5 outs at 2nd on the batter after relays from the cutoff, 2 outs at 2nd after an unsuccessful throw to third not cutoff, 1 out at home after a rundown between 1st and 2nd, and 1 out at home after an error on the throw by the RF. Clearly, the majority of these kills was not due specifically to a skill by the RF, but a combination of RF skill, cutoff man skill and bad base running decisions.
12. Probably something along the lines of keeping runners from advancing and or throwing out runners. Does the fans scouting report say that Matt Kemp has the majors strongest or “Best” arm?
vr, Xei
Xeifrank, I understand and respect your point. Kind of hard to separate the actual arm strength and accuracy from the time getting to the ball, with data. With observation you can of course. But you definitely have a good point.
Yes, I forgot to mention it, but of course I treat the number of outs (0,1 or 2) as a separate bucket. While the big gap is with 2 outs as John says, going from first to third (and tagging from 2 to 3) is quite a bit different between 0 and 1 out.
I can run some y-t-y correlations to see how many years of data is “adequate” (devil fingers) although I can’t stand the idea of thinking that after a certain number of years, a sample metric suddenly turns reliable. If 3 years, what about 2.9 years? And if 2.9 years what about 2.8 years? Next thing you know, you are down to .1 years! How about we just use what we have and attach a reliability based on sample size and the spread of talent in the population (as Tango says, because there is a fair amount of spread in talent with “arm,” we will tend to need less data to have the same reliability as with another metric of which the spread of talent is less).
MGL/17—I see your point. The problem was in my statistically-ignorant wording. I was/am just looking for what sort of sample size, is, well, I don’t know the word.
Maybe giving a concrete example rather than asking a (misworded) general question will help me get at what I’m trying to understand. Take Teahen’s case, cited above. From his Runs/200 scores in 2007 (5.3) and 2008 (-8.3), what would be the best estimation (leaving aside scouting reports for the moment) of what sort of outfield arm he might “really” have? Certainty isn’t the right word, but we can say, for example, that Endy Chavez is a very good fielder because his UZR/plus-minus/etc. have been very good for so many seasons. How would the reliability of whatever we decided about Mark Teahen’s arm (below average, I assume) after two seasons in the outfield compare with what we know about Endy’s fielding?
DF, well, as usual, take some kind of weighted (by recency) average of whatever data you have, in this case, the 5.3 and the -8.3, and then regress toward zero the appropriate amount. How much to regress, we don’t know yet. I would guess that for two years, you would regress around 30-40%. So, if you weighted the -8.3 1/3 more than the 5.3 (A “4/3” weighting by year), you would have -2.7 as a weighted average. Regress that 35% toward zero and you have around -1.8. That is your estimate of Teahan’s true arm value. That would be the same if a player were -2.7 and -2.7 in 07 and 08, or +20 in 07 and -10 in 08, or any other combination that comes out to a weighed average of -2.7.
People think that if a player’s stats are, for example, -8 and +5 in two successive years, that that is somehow less indicative of “true talent” than if a player’s stats were, say, -2 and -2. That is nonsense. The two are exactly equal. Our minds would like to see players have “consistent” sample stats in any given time periods, but that just ain’t that likely to happen. Most of all, whether they are consistent or inconsistent does not mean or change a thing. It does not mean that we are more certain in one case and less certain in another, given the same sample sizes.
(Caveat: For offense at least, players who are less consistent tend to be a little less reliable in terms of a projection than players who are consistent, for various reasons having to do with injury and the fact that player true talent can in fact change at any point in time, as well as the fact that context often changes and it is difficult to account for and adjust to that context in a perfect manner. I don’t know about defense or arm ratings or what have you. And the effect is not large anyway.)
I like to report the size of the sample needed to get an r=.50
So, for batters it’s around PA=200, pitchers is PA=300, and fielders (range) is BIP=400.
Those correspond to roughly: 50 games, 70 IP, and 120 games, respectively.
The question before us is how many games do we need for arms, to get an r=.50. This way, we can see in this continuum, where it fits.
This is different from PizzaCutter, who used the minimum number of games from the threshhold to the maximum games in the sample, to get an r=.7 (which is close to r-squared=.5).
The reason I do not like this at all is that the theshhold could be n=min 50 games = 50 to 162 games = average of ??? . I don’t know. It could be 60, 100, 120. And if he looks at it at a career level, it’s even harder.
I beg Pizza every time he does this to NOT do the minimum business, and instead do it by a fixed number. (Or at the very least, give the average n that results, so that I can reverse engineer it to get a fixed n myself.)
Eventually, he’ll get the treatment I give Sean Forman for ERA+. But, not yet
MGL/19. Thanks again. That’s what I would have thought. I guess I was just a bit thrown by Tango’s initial comment (#5) about needed only half a season, and let that get the better of me. Your commnets re: reliability and sample size in general were very helpful.
just to clarify: I didn’t think that Teahen’s widely disparate results would be less indicative than if they were closer together, rather, my confusion was based on my misunderstanding the context of Tango’s initial comment about needing only have a season.
DF, sure, I meant to say that you did not necessarily imply that. I just thought I would mention that because there are many people who think that a disparity (inconsistency) in yearly numbers somehow means that “there is something wrong” and we can’t estimate a player’s true talent with any reliability. People naturally love it when a player’s OPS is .800, .803, .796. They think, “I am pretty certain this guy is a true .800 hitter. Look, he hits around .800 ever year!” But if they see a guy, with even MORE PA (opps) who hits, .720, .900, .760, and .820, even though it averages to around the same thing, they think, “I can’t estimate this guy’s true OPS! Look, he is all over the board. I have no idea what to make of it. This OPS stat must be really bad to get those kind of wild fluctuations from the same player!” (When people see UZR or some other stat that they don’t understand and/or initially trust, and they see ‘wild’ year to year fluctuations, even from a couple of players, they naturally think there is something wrong with the stat.)
All of that is nonsense. Both guys are functionally equivalent. Our estimates of their true talents, and hence, what we expect from them in the future is pretty much the same, and with the same certainty (actually the second guy with the more opps, probably has a greater certainty).
I have no idea why I asked the arms question in the Kent thread. I must have had this and that thread pulled up at the same time.
Tom, you didn’t really answer my second question (though, I guess you did by saying only 2 runs, which isn’t really significant) in that thread. Seeing how a player like Francoeur can be worth 13 runs with his arm, do you think arms could make enough of a difference to change the notion that RF/LF are interchangeable?
If I look at players who play different positions, I get that the difference between FR and LF is 1 run (in favor of RF of course), the difference between LF and CF is .9 runs (in favor of CF), and RF and CF, .9 runs also (in favor of RF).
So it looks like the difference between RF and LF is 1-2 runs.
Here are the y-t-y correlation coefficients for players who played in back to back years and a min of 100 defensive games in each year:
All positions: r=.420 N=177 Avg number of games per year=136.
LF: r=.227 N=46 Avg number of games per year=134.
CF: r=.463 N=65 Avg number of games per year=136.
RF: r=.500 N=66 Avg number of games per year=137.
So it looks like you regress around 50% for one year for CF and RF, while it takes around 3 1/2 years to regress LF 50%. I am not sure why that is other than perhaps almost no one goes first to third on a single (or second to third on a tag) to left so that there is not that much to distinguish a good arm in left field from a bad arm.
More recent data on players who play different outfield positions indicates that the difference between LF and RF is 2.1 runs. LF and CF is .1 runs favoring CF, and CF and RF, 1.9 runs. These add up almost perfectly, so I trust them more than the prior numbers I reported.
I did not control for age, so there could be somewhat of an age bias, which would make these gaps a little smaller than they would be if there were no age bias. By that, I mean that if some players were, for example, moved from RF to LF as their arm got worse, it would make it look as if there were a smaller gap than there was.
I agree with MGL re LF, it had been my interpretation as well that because there is a much smaller variance in the observed performances in lf, it therefore takes more observation to be able to measure the true talent.
Tom, I just picked up on your critique of my usual reliability method. I do use minimum inclusion criteria when I do the quick-and-dirty year-to-year or intra-class correlation stuff, but that’s usually just to establish that I have something legit in my hand and that I’m not chasing after rainbows. Direction before precision.
When I do my more extensive split-half stuff, which has been most of my recent stuff, when you see 500 PA, I’m comparing samples of exactly 500 PA (to another exactly 500 PA sample) for each player.
Pizza: ah, well great!
Aug 31 15:28
Fans Scouting Report: Update
Sep 02 14:26
Mail: rWAR v fWAR
Sep 02 14:15
WOWY Teachers
Sep 02 13:37
Who’s Waldo?
Sep 02 13:00
It’s hard to beat the crowd (Vegas in this case) no matter how smart you think you are
Sep 02 12:05
Could Rob Dibble have been a comp for Strasburg?
Sep 02 08:36
Team Elin
Sep 02 01:19
Can someone tell me why Trevor Hoffman is still allowed to pitch?
Sep 01 23:16
Strasburg II
Sep 01 22:11
PITCHf/x Summit 2010 - Recaps
Seems like a good way to measure the best arms would be to clock the time it took for the ball to leave the fielders arm, until the time it arrived to the base he threw the ball to, plus any offset for accuracy, which could be measured by the time it took for the fielder to glove the ball and apply the tag/force out. The way the article measures it is purely by baserunner. That might be a good way to measure baserunning, or base path speed, but best arm?
vr, Xei