THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, May 21, 2007

The fielding system approach I’ve been preaching

By Tangotiger, 01:26 PM

http://stat.wharton.upenn.edu/~stjensen/research/safe.html

Whereby UZR and other systems use discrete zones, this approach uses a continuous function.  It doesn’t look like they use as many parameters as MGL’s UZR (park, GB/FB tendency, base/out, etc), but nonetheless, they’ve got the basic framework down.  I’m not sure how they convert plays to runs either. 

(Hat tip: David A.)


#1    MGL      (see all posts) 2007/05/21 (Mon) @ 15:59

The methodology is brilliant of course.  There are a few things they need to do to make the results more accurate, especially in the short run.  Rememeber that the main point in making things more granular, as this methodology is, as compared to the “discrete function” ones, is to increase accuracy in the short run.  The definitely need to separate the fly balls in play into at least line drives and non-line drives if not line drives, fly balls and pop-ups (and fliners?) as all of these obviously have very different probability functions.

As well, they need to incorporate other parameters, as Tango mentions, like runners, outs, handedness of batters and/or pitchers, and certainly park affects.  For example, the numbers for Manny cannot be taken seriously without a park adjustment.

Now, most of these additional parmaters will even out in the long run, but some of them will not, as players who stay on the same team tend to play behind a pitching staff that remains farily constant so that handedness of pitchers/batters may be biased across a player’s 4-year career data.  And obviously park affects don’t even out over the long run.

But, as Tango says, the basic methodology is brilliant and correct.

I would also like to see them estimate a playereach fielder’s average positioning on the field as compared to the average fielder, which they should be able to do from the data.  For example, does Jeter in fact cheat up the middle, does Edmonds play shallow?  Not that a fielder’s positioning should not be included in his “skill set,” but it would be nice to know it anyway.


#2    Peter Jensen      (see all posts) 2007/05/21 (Mon) @ 23:39

The basic approach seems to be the same as Dewan’s plus/minus system but with curve smoothing and the addition of a translation from plays made to runs.  The calculation of the average fielder’s position is really superflous to the overall evaluation as the plus/minus numbers for each individual fielder are still calculated exactly as Dewan calculates them; the difference of the fielder’s plays made at each vector for each speed compared to the average fielder’s plays made at the same vector and speed.  The only change from Dewan is that Dewan uses the raw data for each vector and SAFE has generated a function that creates a smooth curve.  Dewan essentially did his own form of curve smoothing by increasing the angle of each vector slice in the later generations of his system.  Am I missing something?

I don’t like using average run values for hit locations in the outfield as outfielders should be changing their positioning according to the run potential (or win potential) of the base out state.


#3    tangotiger      (see all posts) 2007/05/22 (Tue) @ 00:05

Ideally, you’ll have the “center point” different for each park (applies mostly to OF).  You can also isolate a player’s positioning skill from his range skill.


#4    dcj      (see all posts) 2007/05/22 (Tue) @ 01:14

They go into more detail on how plays are converted into runs on page 31 of this PDF presentation.

Here’s how they do it. Let’s say that for a certain location and velocity for a ground ball in the hole, the league average probabilities are

3B converts it to an out: 0.3
SS converts it to an out: 0.1
No out made: 0.6

In addition, the average run value of a non-out on one of these BIP is +0.5 runs.

They look at all BIP against this team and construct smooth functions for all the fielders. Let’s say that for this type of ball, the probabilities are

3B converts it to an out: 0.4
SS converts it to an out: 0.05
No out made: 0.55

Then the 3B gets a value of (0.4-0.3)*(+0.5) = +0.05 runs for balls of this type, and the SS gets a value of (0.05-0.1)*(+0.5) = -0.025 runs.

In this way they get a function for each fielder, which gives runs saved above average in terms of the type, location and velocity of the BIP. Call this function f(x), where the variable x contains the TLV information.

There is a density function showing the league average probability for different types of BIP. Call this g(x). The integral over all TLV of g(x) should be 1.

The player’s total runs saved per generic BIP is given by the integral over all TLV of f(x)*g(x).

I don’t know how they get from this rate stat to the counting stats listed on their page. It could be that they just multiply by the total number of BIP while that player was on the field. Or they could do something with the actual distribution of BIP faced by the team, but I doubt it.


#5    dcj      (see all posts) 2007/05/22 (Tue) @ 01:23

The calculation of the average fielder’s position is really superflous to the overall evaluation as the plus/minus numbers for each individual fielder are still calculated exactly as Dewan calculates them; the difference of the fielder’s plays made at each vector for each speed compared to the average fielder’s plays made at the same vector and speed.

...

Am I missing something?

Not as far as I can tell. The average starting position is important for them because their smoothed curves have a bump at the peak, representing the starting spot. For infielders, the “going to his right” and “going to his left” best fit curves are calculated separately.

This fails to the extent that fielders position themselves differently for different hitters. It would be better if they at least broke it down by batter handedness, and estimated a fielder’s starting spot from his own data as MGL suggests.


#6    Pizza Cutter      (see all posts) 2007/05/22 (Tue) @ 01:32

MGL already hit on the biggest weaknesses that they are lumping flyners, liners, and fly balls together and they don’t take park effects especially into consideration.

A few additional suggestions: Infielders also catch pop ups, and the occasional liner.  Not a problem with the methodology, just some extra leg-work needed.  I’d also love to hear how they calculated the average run value of an event.  In this case, if they have data on the average run value for different parts of the park, that could be a nice little study unto itself.  IIRC, Baseball Prospectus (or one of them...) broke things down by type of hit earlier in the year.  These guys may have more than they even realize.

But, add my feet to the bandwagon.  That’s some right nice work there.  Any measure that shows the Yankees in last place gets my vote!


#7    dcj      (see all posts) 2007/05/22 (Tue) @ 02:10

A few other issues:

1. It appears that they are not subtracting off the value of the out when they calculate the run value for a particular play. That would be compressing their numbers.

2. I think their system over-rewards “popup hoggers” and the like. This can be seen from the example in post #4.

3. From page 42 of the PDF presentation, it appears they are including HR in their count of BIP. This just emphasizes the need for park factors of some kind. With 4 years of data they could create a separate model for each park, though they’d have to adjust for the effects of that team’s hitters and pitchers.

4. On a similar note, their model for outfielders (see pages 34-35 of the PDF) doesn’t look like it handles the shape of the fences very well.

5. I am just speculating that the way they get a counting stat is by multiplying the rate stat by total number of BIP while that player was in the field. Since the measure is really about (unregressed) true talent rather than actual performance, a counting stat may not even be appropriate. Wouldn’t Dewan’s method be better for this purpose?

The way I see it, the natural form for this stat is as a rate per generic BIP, or that rate multiplied by the average number of BIP a team sees in 162 (or 150) games.


#8    dcj      (see all posts) 2007/05/22 (Tue) @ 02:13

A few additional suggestions: Infielders also catch pop ups, and the occasional liner.

I think they include this. “Below, we give the SAFE values for each infielder, averaged over the 2002-2005 seasons. These values consist predominately of grounder balls-in-play (g-bip) but also include any infield pop-ups or liners.”


#9    Peter Jensen      (see all posts) 2007/05/22 (Tue) @ 02:40

Pizza Cutter - They explicitly say in the article that they have already included pop ups and line drives in their infielder ratings.

MGL and others - Since they are inferring the positioning from the optimum position for making the play, I don’t see how they could generate a starting position for each player.  To try and do so would result in the circular reasoning that a player started from a position that maximized his chances to make all possible hit balls rather than a position that maximized the possibility of making the play on the type of hit ball likely given the particular batter, pitcher, pitch type, base out situation, and game runs that existed for each hit ball.  The latter would be the ideal positioning for a player, which of course might not be the player’s actual positioning.

Given that deciding whether a ball in the air is a fly, liner or fliner is a judgement call by an observer and that we have no evidence that these judgement calls are either repeatable or consistent from park to park in BIS data, it may not add any accuracy to the evaluation to breakout the hit type individually.  In any case the resultant run change for any individual for doing so is relatively minor, on the order of a run or 2 a year.

I still don’t see this system as any advancement over Dewan; same data, same strengths, same drawbacks.  That doesn’t mean that I don’t think it is good.  If either Dewan or SAFE was adjusted for park effects it would probably be as good as any system we can get from BIS data.  Although adjusting for batter handedness is important for zone systems (since zones of responsibility would change), I believe it remains to be studied whether it would make any difference to a system like this where every hit ball is evaluated regardless of zone.  Pitcher handedness would affect the number of balls per inning pitched that each fielder would receive and therefore counting stats, but shouldn’t affect the rate stats (runs +- for each fieldable BIP) for each player.


#10    Peter Jensen      (see all posts) 2007/05/22 (Tue) @ 02:45

There is also the problem that Dewan found much of the BIS PBP data from 2002 and 2003 (and some from 2004) unsuitable for his +- analysis and essentially threw it out.  This is not addressed by the authors of SAFE.


#11    dcj      (see all posts) 2007/05/22 (Tue) @ 04:08

Peter, good points. On the question of determining initial positioning, say for an outfielder. For each region we can figure out the fraction of balls falling in that region that were caught. It would have to be broken out by type (fly, liner) and velocity. Then wouldn’t a sort of center of mass calculation do the trick?

This could be done separately for LHB and RHB to come up with two different starting positions, or taken further to incorporate the game situation, though at that point the sample size gets a lot smaller.

---

On the question of assigning runs per play. In my example from above, the league average probabilities were

p_3B = 0.3
p_SS = 0.1
p_hit = 0.6

and the team’s probabilities were

q_3B = 0.4
q_SS = 0.05
q_hit = 0.55

("Hit" covers all plays on which an out is not recorded.) Suppose the average swing in run value between an out and a hit is 0.8 runs. Then for a BIP of this type, the 3B and SS should get values of

3B_value = (p_hit*q_3B - q_hit*p_3B)*(0.8) = +0.06 runs
SS_value = (p_hit*q_SS - q_hit*p_SS)*(0.8) = -0.02 runs

This is based on the following logic. The average BIP of this type has a run value of 0.4*0.3 + 0.6*-0.5 = -0.18 runs for the defensive team. An out is 0.48 runs above average and a hit is 0.32 runs below average.

For each ball that goes for a hit, the 3B should lose (0.3/0.4)*0.32 runs and the SS should lose (0.1/0.4)*0.32 runs. For each BIP gotten by the 3B or SS, he gets 0.48 runs and the other guy gets zero.

Working from there gives the formula above.


#12          (see all posts) 2007/05/22 (Tue) @ 04:51

To try and do so would result in the circular reasoning that a player started from a position that maximized his chances to make all possible hit balls rather than a position that maximized the possibility of making the play

Although they assume the player started from the maximum position that isn’t the correct interpretation. The right interpretation is that the function represents the probability of an outfielder making the play. The position of the CF, say, in this system is irrelevant.

The benefit of a continuous function over discrete micro-zones is that it allows us to get over sample size issues that small zones have (although UZR has big enough zones so that this isn’t a problem).

I agree though that in the current incarnation it is not as good as UZR/ plus-minus


#13    Joe Arthur      (see all posts) 2007/05/22 (Tue) @ 07:17

Another source of information is
this interview.

Given the sophistication Shane Jensen’s system shows overall, I’m inclined to assume that he treats line drives together with fly balls for outfielders as an acquiesence to the substantial discrepancies in the BIS data from year to year, rather than because he doesn’t know better. [As far as Peter Jensen’s comment about Dewan “throwing out unsuitable data” - I wonder if the reality is that Dewan had missing observations rather than bad observations. My only evidence for this is an old comment by Pinto in one of his early discussions of PMR - that BIS data was incomplete that year (2003?) because they score from video, and there was no video available for some games.]


#14    Joe Arthur      (see all posts) 2007/05/22 (Tue) @ 08:06

The way I read the system, it sounds like it does the following:
1) aggregates data to create a smoothed function for the average player, based on hypothetical average starting position for the fielder
2) creates a smoothed function for the individual player based on his own data. From the few examples, it seems to pivot around the same hypothetical average starting position. It is not clear to me what impact this assumption would have on the smoothing in the individual player’s function.
3) applies a typical frequency of balls in play to the fielding position and uses that to weight the differences in plays made between the player and a typical fielder.
4) applies a typical weight for bases allowed

I am uncomfortable with this many layers of adjustment. If I’m correct, the SAFE results are a hybrid between actual performance and ability, like a player’s real batting line being reconstructed according to a standard percentage of left-handed and right handed pitching faced.

For the infield, the fielder’s movement has a direction in the model (going to his left vs going to his right). In the outfield though, the model seems to depend simply on “radius” to the ball from the hypothetical starting position, as if it didn’t matter to what extent the outfielder’s movement was lateral, coming in or going back. You’ve got tradeoffs between greater average hangtime on deep balls and the greater difficulty of turning around and running looking over your shoulder. ( I don’t know how those play out.) It would be interesting to see how well the circular smoothed model really fits the actual results. Progress might be possible with a more complicated model of outfielder’s movement.


#15    tangotiger      (see all posts) 2007/05/22 (Tue) @ 08:47

I didn’t get the sense that it was a radius/circle, but rather a continuous function.  An ellipse would be an example.


#16    Peter Jensen      (see all posts) 2007/05/22 (Tue) @ 10:32

Joe - I think you are correct about the data being missing rather than bad.  However, I was quoting Dewan’s representative’s email response to me about why some player’s individual “outs made” numbers in the Fielding Bible were fewer than the number of “outs made” as calculated from Retrosheet in 2003 and 2004.  When the distance/vector information was absent Dewan apparently just ignored that entire play.  Since some players had many more plays missing than other players both the individual players’ and the composite “average” player numbers were affected.  As I said in the previous post, so far it is unclear how SAFE deals with the missing data in its player evaluations.

John Beamer - I think you are agreeing with everything I said in post #2.


#17    joe arthur      (see all posts) 2007/05/22 (Tue) @ 11:24

Tango, look at pp.34-35 of the pdf dcj linked to in #4 above. The picture shows a circle and the equation uses one radius. An ellipse of course requires 2 radii ...


#18    Mike Green      (see all posts) 2007/05/22 (Tue) @ 12:03

Hmm.  I don’t see this as much of an advance. David Pinto’s charts do the same thing for me in terms of showing a defender’s range over a continuous scale.

If one is seriously trying to put a number to runs saved/allowed by a defender (and that is monstrously difficult task), one has to do much, much more.  The list:

-park adjustment
-removal of pop-ups
-double play efficiency consideration and apportionment
-consideration of bases gained and lost (particularly for outfielders)

On the latter point, the system favours the outfielder who plays shallow.  The triple over his head is credited the same as the single that falls in front of the outfielder who plays deep.


#19    MGL      (see all posts) 2007/05/22 (Tue) @ 12:42

The reason that the the data and ensuing functions should be broken down into LHB/RHB is because of postioning.

And yes, they should provide a rate stat based on league-wide average BIP or something like that.  After all, we want to use these SAFE results to be able to compare one fielder’s ability with another’s, in a context neutral environment.

In the long run, it does not matter much whether they lump all air balls together as the % of LD, pop flies, etc., should even out for all fielders, but definitely in the short run, lumping them all together is NOT a good idea.  Imagine an outfielder who gets nothing but line drives his way (and catchers almost none of them.  His performance curve is going to be compared to the league-wide fly/line/pop/etc. performance curve and he is going to look like a horrible defender.  I am not sure if they use “speed” of the batted ball for air balls though.  If they do, then I suppose that may be a decent proxy for the height of the ball as well.

I agree that because of the adjustments, UZR and Dewan’s (with park adjustemnts) are going to generate much better numbers.  However, as soon as they clean up all of the adjustment problems, this method is going to be the most accurate.


#20    tangotiger      (see all posts) 2007/05/22 (Tue) @ 12:44

Joe/17:  there are two functions.  One is how much time a fielder has to get to a spot, and the second is how much time it takes for a batted ball to get to that spot.

I don’t really have a problem making the first one with a constant radius.  That is, if a fielder has 4 seconds to run, how many feet would he cover moving toward 2B, LF, RF, or the fence?  I agree that if you want to model this right, then do it right.  But, is the circle a poor approximation?  I’d think it’d be an excellent one.

When you combine the two variables, and you want to get “% of balls caught by hang time”, I think the two-function model works as a starting point.  Obviously we want to know more.  And they all relate to the fielder positioning (park, handedness of batter, type of spray pattern of batter/pitcher, etc) or time in air (park, GB/FB tendency of batter/pitcher, power hitter, etc).

That’s where the focus is.  Where the fielder starts off, where he goes, and how long does it take for him and the ball to get there.  And, you’d like to have a continuous function that handles all the parameters, rather than having a separate equation for every discrete slice or zone.


#21    Pizza Cutter      (see all posts) 2007/05/22 (Tue) @ 13:07

A circular/radial model might have a conceptual problem.  An outfielder will probably have more range in front of him than behind him because he can run at the ball and look at it at the same time.  Going back, he will slow up a bit because he has to run while looking over his shoulder (which throws his weight the wrong way for a split second plus distracts his attention), plus he has to worry about crashing into the wall.  I don’t know if these would actually shake out in the data, but they’d be worth looking at.

If I read the SAFE method for assigning starting fielding positions right, they go from the middle of the area where the fielder made 100% of the plays (or something close).  If there’s more range going forward and less back (and if MLB outfielders adjust this with their positioning), it could mean that the starting places have been assigned are actually a few steps closer to the infield than they should be.


#22    tangotiger      (see all posts) 2007/05/22 (Tue) @ 13:46

The way I would model it would be to fit the data.  Just because point x,y has 99.9% balls caught wouldn’t make that the starting point.

What you have to do is presume a starting point, and take the actual ball distribution, and figure out if the model out % matches reality.  For example, presume a starting point for the LF that is right on the line, and 250 feet up.  Obviously, we’re going to be pretty off in the out rates for every x,y spot on the field.  You try it 5 degrees over, at the same feet distance, and you’ll be off alot, but a bit less than before.

With this approach, you’ll be able to figure out:
- the fielder’s starting spot
- the fielder’s optimum starting spot

This presumes a constant starting spot, for the parameters in question.  You could have a different starting spot for LHH, RHH, and so, you’d minimize the differences based on those parameters.

It also requires you to create a function of reaction time and acceleration.  And of course, preferably for each fielder, to further minimize the differences.

It all goes back to trying to model what we see:
1. where the fielder is
2. how fast does he get to a point
3. how long does a ball get to that same point


#23    Peter Jensen      (see all posts) 2007/05/22 (Tue) @ 13:47

The discussion about a fielder’s starting position is silly.  Any good outfielder (and most infielders) changes his position on each pitch depending on whether the pitcher has the count advantage or the batter and what the pitcher is likely to throw and how successful he has been throwing it so far in the game and how successful the hitter is in hitting that pitch, etc.  To try and ascertain a “probable” starting position from aggregate data is not useful information.  If Tango eventually gets his way and all players wear GPS devices or video covers the entire field and the actual starting position is known, then maybe you could learn something useful.  But you can’t devine from the data something that is not present to begin with.

In my opinion knowing the fielder’s actual starting position is not going to be worth the trouble for any analysis that we are going to do.  It might help coaches to improve a player’s positioning, but they are probably pretty good at that already.  It is part of a player’s job to position himself to make the best play to help his team win the game given his knowledge of all the factors about where the ball is likely to go if hit on every pitch, in every situation, and given his knowledge about his own skills and limitations.  The only thing that really counts is whether he makes the play or not, not how he makes it.


#24    Tangotiger      (see all posts) 2007/05/22 (Tue) @ 13:53

The positioning skill and the range skill are two components.  If you have sample data that shows you how perform in positioning and how you perform on your range, they each need to get regressed.  And they’ll be regressed differently.

A hitter’s job is to create runs, and it’s irrelevant if he does it with HR or singles.  A run is a run.  But, in order to know how true his performance is (how well his actual performance is linked to his true talent level), you need to know how he did it, so you know how much to regress his sample performance.

***

The additional point about the positioning is that it’ll help with park adjustments.  If LF are playing Fenway alot closer than other parks, you can still use that data in the continuous function, if you have starting position as one of the parameters.  So, rather than applying a park factor after the fact in whatever form, it is introduced as part of the function.


#25    Peter Jensen      (see all posts) 2007/05/22 (Tue) @ 15:10

You are right and I was wrong.  Positioning skill and range skill are two separate components and can be regressed individually.  I’ll even go further than that and concede that range skill is itself made up of several components; visual acuity, reaction time, acceleration, and top speed.  And positioning skill is probably also at least two components; initial positioning, and play judgement.  But it will probably remain impossible to ever separate out the effects of each of these components.

I do think you will have to wait until you have the real data on actual positioning to do the type of analysis that you want to do rather than using an inferred positioning derived from aggregate data.


#26    dcj      (see all posts) 2007/05/22 (Tue) @ 15:16

A couple quotes from the interview that Joe linked:

When we estimate our smooth curves for each player, we estimate a different curve for liners, fly balls, and grounders.

Thus far, we have only taken into account park geography in a couple of extreme cases, such as the green monster, but we are planning to incorporate all differences in outfield geography into our models in the near future.

So it seems like they do separate fly balls from line drives, and also they have a couple park adjustments already.

Also about the radius thing for outfielders. They have two parameters, radius (from the outfielder’s assumed starting position) and distance from home plate. So the model can and does distinguish between going back and coming forward. In fact this is mentioned in the interview with respect to Erstad.

Finally I want to agree 100% with this from Joe:

If I’m correct, the SAFE results are a hybrid between actual performance and ability, like a player’s real batting line being reconstructed according to a standard percentage of left-handed and right handed pitching faced.

As far as I can tell, the only thing you need to do to make it a measure of pure ability is regression to the mean. I think that should be the goal of the system—if you wanted to measure actual performance, why go to the trouble of constructing the player-specific smoothed curve?


#27    Rally      (see all posts) 2007/05/22 (Tue) @ 15:26

Measuring how much of a fielder’s performance is range vs positioning would be tremendously valuable to a team, far more than measure of how many runs saved vs an average fielder.

You could use it to improve the defense of your own players, and target free agents who have great range but poor positioning, assuming they are willing to listen to their coaches and move where they need to.


#28    Tangotiger      (see all posts) 2007/05/22 (Tue) @ 15:59

Peter/25: At the moment, I’m not sure how well inferring positioning works.  It’s an idea I’ve had for a few years, and I tried it with the BIS data I had (from 2004 I think).  The problem is really the lack of data points.  You may be right that the inferring process won’t work well, but it’s definitely a good challenge to get your teeth into.


#29    Guy      (see all posts) 2007/05/22 (Tue) @ 16:03

I think DCJ/7 raises an important point, which is the FB-hogger issue.  In my view, fielders should be rewarded for catching FBs over the aggregate probability the ball would be caught by SOMEONE, not only in comparison to the likelihood the individual player would make the out.  If I’m following the methodology correctly, when a CF makes an out on a ball that is .4/.5/.1 LF/CF/H, then the CF is +.5 and LF is -.4.  But a more accurate allocation would be CF +.1, LF no impact. (And dcj’s formulas in post 11 would do exactly that, if I’m interpreting them correctly).


#30    Tangotiger      (see all posts) 2007/05/22 (Tue) @ 16:24

I agree that when it comes to overlapping zones, you can’t subtract something from someone, who had no opportunities.

First allocating it on a team level, and then distributing that at the player level, seems the most sensible thing to do.  If it’s a .90 out play, then the out gives you +.10 and the hit gives you -.90.

The +.10 goes to whoever gets it.  The -.90 is distributed based on the percentages of outs made for that point (in Guy’s example, 44.4% to LF and 55.5% to CF).


#31    Tangotiger      (see all posts) 2007/05/22 (Tue) @ 16:26

You’ll still have a ball-hogging issue, just not as pronounced.  A guy who is always called off will never get the chance to pile up the +.10, but will always get parts of the -.90.


#32    Guy      (see all posts) 2007/05/22 (Tue) @ 16:43

I notice that O. Hudson is at the top of his 2B rankings.  I may be misremembering this, but I seem to recall from some old PMR discussions that he appears to be a notable ballhog. 

In the OF, it would be interesting to look at how ratings of extreme players’ teammates change (if at all) when they are in/out of the field.  For example, when Crisp isn’t playing LF do his CFs get “better?” For that matter, there may be anti-ballhogs as well ("ball anorexics”?).  Maybe Griffey’s teammates deliberately take FBs to the alleys, to put less strain on him, when in fact he could make the play if he needed to.  Same could have been true for Bernie towards the end.  I’m not saying this is true, just that it could be.


#33    Shane Jensen      (see all posts) 2007/05/22 (Tue) @ 16:50

Hey all,

Just checking in to say hi, and thank you all for the fantastic feedback/questions on the work we’ve been doing with SAFE.  I can’t hope to answer all the questions that have been posed in one comment, but I will give a couple quick answers to some of the more common questions:

Estimating Fielder Starting Point:

As several commenters have inferred, in order to estimate the starting point of each position: we used the aggregate bip data (over all players and all parks) to calculate the point on the field with highest probability for making a successful play.  We used this same starting position for our average fielder curve and all of the individual fielder curves.  Thus, a particular players above average performance can be a result of either above average range or above average positioning, or both.  As pointed out, it is hard to separate the two effects (positioning vs. range) without having data on the starting position of individual fielders.  The comments about fitting different starting points for batter-handedness and park are great suggestions. 

Fly balls and Liners are not actually lumped together:

My SAFE website is a little misleading on this point because I do seem to imply that we lumped liners together in with fly balls. This is not actually the case...we estimated separate models for liners vs. fly balls.  The only thing that was lumped together is the overall run value for each individual player: their overall run value is the sum of their run contribution for fly balls plus their run contribution for liners.  On this same point, I should note that we did fit models for pop-ups for infielders, and their overall run contribution given on the SAFE website is the sum of their run contribution on grounders and their run contribution on pop-ups.  However, it is also worth noting that the run contribution on pop-ups is very slight for all infielders relative to the run contribution for grounders. 

Effect of Park:

As pointed out, this is one of the more obvious weaknesses of the current methodology, and the one that we are currently working on.  There is a bit of tradeoff here, since fitting different bip densities for each park would make each bip density a lot more noisy.  However, it is certainly worth it in at least a few cases, such as the effect of the green monster.  I’m also excited to see if throwing in a turf vs. grass factor into our grounder model makes a difference.

Again, thanks for all the feedback and comments, and feel free to contact me directly if you have additional questions.  SAFE is still a work in progress, and one that I do primarily what little spare time I have, but I’m always looking for improvements. 

Shane.


#34    Tangotiger      (see all posts) 2007/05/22 (Tue) @ 16:53

It’s always been my hope that someone would take the idea from this article:
http://www.tangotiger.net/catchers.html

And apply it—modified for aging among other possible things—to other player-dependent measures.  The putouts of CF/LF, CF/RF, the assists of SS/3B.  The DP of SS/2B.  The putouts/assists of 1B/P.  The RF/2B/C relay.  Whatever.  I realize that some of these would have sample size issues.


#35    Tangotiger      (see all posts) 2007/05/22 (Tue) @ 16:59

Shane: if you are not aware, you’ll probably be interested in the “Part 2” article linked here from MGL’s UZR:
http://www.insidethebook.com/ee/index.php/site/comments/mgl_archives/

***

As for the play to run conversion, can you go through an example.  We’re specifically interested to see if you treat the run value as .50 or .80 runs.  (And you’ll get an earful if it’s the former.)

As well, for the non-plays made, do you give the same run value, or do you give a run value based on distance from fielder?  For example, if you are playing shallow, then a ball that falls for a hit over your head will be a double in some extra cases. 

This is similar to the 3B and playing the line.  If the 3B plays the line, he turns 2B into outs, but lets the outs turn into hits in the hole.  So, it becomes important to know how much run value you are giving for outs and hits.


#36    Tangotiger      (see all posts) 2007/05/22 (Tue) @ 17:01

For those interested in the .50/.80 run value, you can read about it here:
http://www.tangotiger.net/archives/stud0247.shtml#1011

If you want to discuss this specifically, I’ll open up another blog entry for it.


#37    Shane Jensen      (see all posts) 2007/05/22 (Tue) @ 18:09

Run value is also estimated on a continuous scale.  Every single point in the outfield is assigned its own run consequence, which is calculated by looking at all the bip that landed at that point and weighting the proportion of these bip that are singles/doubles/triples by the run value of a single/double/triple.  As an example, for some point in the power alley, the proportion of bips might be 20% singles, 60% doubles and 20% triples, in which case the average run consequence for that point would be 0.2*run(single) + 0.6*run(double) + 0.2*run(triple).  Now lets say that there are an average of 40 bip to that point in a season.  If a particular player gives you a 5% higher chance of making a catch at that point, then that player has saved you 40*0.05*[0.2*run(single) + 0.6*run(double) + 0.2*run(triple)] runs. 

That entire description was for a single point, but in reality, we have to use numerical integration to calculate the run contribution over all points in the field that an outfielder can possibly reach.  Thus, a player who happens to be better than average moving forward will not have a positive run contribution if he is also worse than average moving back, because the run consequence of missed catches behind him is higher. 

The same general idea also applies to grounders in the infield, but in that case, each bip is not represented by a point in the field, but rather an angle from the third base line.  Each angle has its own run consequence that is calculated by again looking at the proportion of missed balls that end up as singles/doubles/triples.  So, grounders down the first base line have a higher run consequence (since many are doubles or triples) than grounders past the shortstop.  This presents an interesting defensive tradeoff: shortstops have to handle many more bip to their position, but the consequence of a miss is lower than a miss by the first or third basemen. 

Shane.


#38    tangotiger      (see all posts) 2007/05/22 (Tue) @ 19:30

For run(single), is it closer to .50 or .80?


#39    Rally      (see all posts) 2007/05/22 (Tue) @ 21:35

In other words, are you looking at just the value of a hit, or the value difference between a hit and an out?


#40    MGL      (see all posts) 2007/05/23 (Wed) @ 01:08

You will find that on grass fields the ball goes through the IF more but that there are more IF hits and vice versa for turf.  However, with more and more turf fields (actually almost all of them) being “next-turf,” that is becoming increasingly not true and you will see little difference between grass and turf.  Not to mention the fact that that there are infield grasses that are kept long and those that are kept short.  Plus, at higher elevations, the ball scoots through the IF more quickly than at lower elevations, given the same grass height.

In UZR, if an average distribution is .4/.5/.1, if a play is made, no one gets docked, but the CF gets .5 credit and the LF gets .6 credit.  If a play is not made, then -.9 is allocated between the CF and LF at a ratio of 5 to 4.  I don’t think you can allocate .1 to either the LF or CF when a play is made as everything will not add up correctly.  Plus, let’s say that in a certain zone or at a certain point, the league average is .01/.2/.79.  This is obviously an area that is difficult for anyone to reach and almost impossible for the LF to reach (.01).  If the left fielder makes the play, it is likely a spectacular catch.  If the CF makes the play, it is likely a very good, but not great catch (.2—edited by Tom, was showing .02). You do not want to give them the same credit when one or the other makes the play.


#41    Tangotiger      (see all posts) 2007/05/23 (Wed) @ 07:27

Isn’t it more likely that a .01/.20/.79 situation is a case of positioning?  That rather than a “1 in 100” play for the LF, that this is a case of a LF playing closer to CF, and therefore, if we had positioning data, this would be .10/.15/.75?

So, the .01/.20/.79 would break down as follows:

90% of the time:
.00/.206/.794

10% of the time:
.10/.15/.75

I would say it’s more prudent to give +.79 to the OF first, and then allocate that.

I’d rather everything add up first.  If the Tampa OF is +20 plays, it’s +20 plays. I wouldn’t want to then see that Carl Crawford is +30, while the rest of the OF is +0.


#42    Guy      (see all posts) 2007/05/23 (Wed) @ 09:00

MGL, I don’t see how your method “adds up.” On the .4/.5/.1 example, the fielders (if average) will receive:
LF: .4*.6 + .1*-.9 = +.15
CF: .5*.5 + .1*-.9 = +.16
How can average performance be a plus value?

And I agree with Tango.  If you’re going to assume the CF really didn’t have a .2 chance on this specific ball, then you have to question whether the LF was truly .01.  It’s a dangerous path to go down.  I think you have to apply the best probabilities you have on each ball.


#43    tangotiger      (see all posts) 2007/05/23 (Wed) @ 10:31

The net effect of what I’m saying is the following:

Given that the LF recorded an out, the probabilities were:
.21/.00/.79

Given that the CF recorded an out, the probabilities were:
.00/.21/.79

Give that no out was recorded, the probabilities were:
.01/.20/.79

So, the 1% of the time that the LF records an out, he gets +.79, the CF gets 0.

The 20% of the time that the CF records an out, he gets +.79, the LF gets 0.

In order to get to zero, then the 79% of the time that no outs were recorded, the LF gets minus -.01 and the CF gets -.20.

***

Does it make sense?  Not necessarily.  You could change the conditional probabilities, so that, given that the LF recorded an out, the probabilities may have been:
.30/.10/.60

And given that the CF recorded an out, the probabilities may have been:
.00/.61/.39

Give that no out was recorded, the probabilities were:
.009/.10/.891

In this case, you give the CF -.10 for every LF out, +.39 for every CF out, and -.10 for every hit.

You give the LF +.70 for every LF out, 0 for every CF out, and -.009 for every hit.

***

Whatever approach is taken, it should all add up.


#44    Shane Jensen      (see all posts) 2007/05/23 (Wed) @ 11:30

I’ll have to check on the values we used for run(single),run(double),etc.  IIRC, we averaged over the number of runs scored for all singles, doubles, triples, etc. in that season.  However, the excellent point remains...should we be punishing the fielder not just with the number of runs that scored, but also the out that was lost.  Now that I am sitting here thinking about it, that makes sense.  With this adjustment, I would expect that the SAFE magnitudes might increase slightly, though the ranks would probably not change.  This should definitely be done, however.  I’ll move it to the top of the to-do list. 

One additional thing worth mentioning is that the averaged values you see on the SAFE website are shrunk towards the population mean to account for differences in playing time between players.  As an example, if Hiram Boccachica has a raw SAFE value of +10, then we are going to shrink Boccachica’s SAFE value down since we have much less data on him to justify a high SAFE value.  In contrast, if Ichiro has a raw SAFE value of +10, he will not be shrunk as much since we have a lot of data behind this value.


#45    Guy      (see all posts) 2007/05/23 (Wed) @ 11:54

Wow, you mean Franceour is +14 runs even after you regress based on having only one season of data for him?  That’s incredible. You should definitely highlight this aspect of your methodology, as other researchers might make the mistake of regressing your results a 2nd time when trying to assess players’ true talent level.

You should also consider adjusting your run calculation to avoid giving too much credit to players who take an unusual share (+ or -) of discretionary plays—plays that can be made by multiple players—as we’ve been talking about here.  On a routine FB that is caught 97% of the time (by someone), there’s really no reason to give the LF credit for .6 plays just because a LF usually fields that ball 40% of the time. The TEAM—which is what we care about—has only made a gain of .03 plays.


#46    Peter Jensen      (see all posts) 2007/05/23 (Wed) @ 12:04

Shane - Your post #44 raises an interesting question.  When you give the individual fielder’s SAFE values and say that they are “averaged over the 2002-2005 seasons” exactly what do you mean?  Perhaps you didn’t intend the information on your web site to be read as critically as this blog’s audience is reading it, but the form we like to see data presented in is a rate stat like runs per 150 games or runs per 1000 innings or runs per 600 hit balls in area.  And for us the raw data is better than data that has already been adjusted for small sample size.  This way comparisons to other systems are easier.


#47    Tangotiger      (see all posts) 2007/05/23 (Wed) @ 12:19

Shane/44: post 36 shows why you want to use the extra 0.30ish runs for the out.

***

Shane/44: you are talking about regression toward the mean, I suppose?  In that case, you are better off presenting both numbers.  After all, if someone goes 30-70 for the season, he’s presented as a (sample) .429 hitter with 70 AB, not a (true) .283 hitter after regression.  Presenting true numbers are fine, but only if you show how you convert from sample to true.  Otherwise, presenting sample rates, with opportunities, would be preferable.

***

Reading posts 45/46 now, and I echo Guy and Peter’s sentiments.


#48    Shane Jensen      (see all posts) 2007/05/23 (Wed) @ 13:57

You’re right...I should make the raw values by year available as well.  I will prepare that for y’all and post it on the website asap.  Thanks again for all the feedback...it’s great to be chatting with people who really want to dig into the methodology!

Related but off the main topic: in response to your comment Tangotiger/34, I have been working on a paper that looks at throwing ability for both catchers and outfielders.  I will post a link to it as soon as it is up on the web (which should be in the next couple of days).


#49    Tangotiger      (see all posts) 2007/05/23 (Wed) @ 14:05

Shane/48: you may be interested in the work of John Walsh which you can find links to from here:
http://www.insidethebook.com/ee/index.php/site/comments/outfield_arms/


#50    MGL      (see all posts) 2007/05/23 (Wed) @ 16:20

I agree with Guy/45 about discretionary plays, with the assumption that plays that have close to 100% catch rates are discretionary.  I have toyed with the idea of removing them completely or at least reducing their weight/impact.  As Guy states, let’s say that a certain ball is .6/.35/.05.  It is clearly a can of corn somewhere.  If a certain LF is .7 rather than the .6, do we really want to give him plus credit and dock the CF because he is only .25?  Maybe a little, I don’t know.  That of course is the problem with popups on the IF.  Most of them have a high catch rate and there might be no reason why you would want to include them in the data.  Pop-ups behind the IF are probably another story, although they are tricky as there is some discretion also.  However, maybe no more discretion (among the OF’ers) that any other fly ball to the OF.  Dewan uses pop flies behind the IF (for the OF’ers) I think but not pop flies on the IF.  I don’t use any pop flies at all for IF’ers.

And I wholeheartedly agree about the regression issue.  Need to present the raw data.  If they/Shane then want to do a regression toward the mean, that’s fine too.  If they do, I’d like to see the methodology for that (the regression equation or whatever).

One of the ways to figure out how to allot credit is to do a computer simulation and “reverse engineer” a UZR or Dewan type methodology.  I think perhaps that with the methodology that Shane is using you don’t have as much of a problem with the allocation as you do with methodologies that use large (or even small) slices of the field.

For example, if an average LF only catches 5% of all balls in a certain sector then if a certain LF catches 7%, it is likely that he got a few more balls than average in the part of the sector that is closest to him, so you don’t want to give him full credit for that extra 2%.  It is also likely that for that extra 2%, he was playing closer to that sector than the average left fielder.  With the continuous function methodology, there are no sectors, so that if a fielder catches more balls at a certain point on the field, he indeed had more range on that ball than the average LF, but…

it is still likely that he was positioned a little bit closer to that point than the average fielder AND it was also probably a little more likley that the ball was a little more catchable in general than the average ball to that spot (maybe it had more hang time).

One of the issues with not giving “full” credit to the extra catches that a fielder makes (or dock him full credit for fewer catches) is, “When do you want to do the regression?” If you don’t give him the full credit for extra catches because it is likely that some of that is the “luck” of the exact location and hang time of the ball, then you are in effect regressing the numbers in an intermediate stage.  You can also give full credit and then regress afterward and accomplish the same thing.  It is better to do it in the intermediate stages because different things may get different regressions.  If you wait until the end, you are regressing everyone the same regardless of the natrure of the raw data with respect to that player.  That is why granular data is always better than non-granular.  In the long run, they usually end up with the same result but if you can reduce the regression you have to do after you have crunched all the data, then you will get more accurate results in the short run using the granular data (and essentially doing the regression during intermediate steps).


#51    tangotiger      (see all posts) 2007/05/23 (Wed) @ 16:48

In the MGL/50, para 1: we are suggesting giving the ballcatcher (be it LF or CF) +.05.  We give the hitallower -.60 to LF and -.35 to CF.

Let’s say Carl Crawford catches it 85% of the time, and Baldelli makes it 10% of the time, and a hit falls for 5% of the time.  So, it’s still the same LF+CF out rate as the leaguee average, except Crawford gets them all.  What’s the result?

Crawford: +.05 * .85 - .60 * .05 = +.0125 outs per BIP
Baldelli: +.05 x .10 - .35 * .05 = -.0125 outs per BIP

If you have 80 such discretionary plays, Crawford gets +1 and Baldelli gets -1.  Hardly worth arguing if Crawford should really be 0, or if we can live with +1.

The long-short of it is that this method does reduce the impact greatly.  We’re not throwing out any data, and we treat it rather fairly.


#52    MGL      (see all posts) 2007/05/23 (Wed) @ 22:07

Tango, sounds reasonable.  I might have to redo my UZR methodology and change the way I allocate credit for balls caught and not caught.  I have to think about it some.


#53    MGL      (see all posts) 2007/05/24 (Thu) @ 00:06

Actually, I am pretty sure that I do it your way (Tango), but I will have to check.  I vaguely remember thinking that in zones which are almost 100% caught zones, it did not really matter much who caught it, which is the way it should be.  Plus if you remove those from the data it is not fair on a team level.  If the average catch rate in a zone is 95% and a team catches 98%, they should get credit for it, regardless of whether the balls in that zone were a can of corn or discretionary or not.


#54    MB      (see all posts) 2007/05/24 (Thu) @ 00:39

Hey Guys. Has anyone ever looked into the speed of the batter and its effect on infield defense? We can pinpoint the probability of a play being made, on average, at a certain point on the field, but obviously I would think there are different probabilities with different batters.

For example, a ball hit up the middle at a certain location might have a .5 chance of being made by the shortstop. But with Manny (or someone of his speed) running it’s .7 and with Reyes it’s .28 (or something like that)…

I’m guessing this is one of those things that evens out over time or at least that you would expect to. But what if a guy played in a division with faster runners or randomly got a high number of difficult plays with fast runners? Is this just a reason why you regress the sample data?

Anyway, more great work being done and it’s always fun reading this blog.


#55    Shane Jensen      (see all posts) 2007/05/24 (Thu) @ 11:40

In Shane/48, I mentioned a manuscript that I have been working on for evaluating catcher/outfielder throwing ability.  I now have a link for download:

http://www.arxiv.org/pdf/0705.3257

Thanks for the Tango/49 reference...I’ll definitely check out that work asap.  I’ll also check in over the next several days with some comments on sharing of consequences and other things that we’ve tried to build into SAFE.


#56    tangotiger      (see all posts) 2007/05/24 (Thu) @ 12:40

This is in my archives, but I figure many of you haven’t seen it:

http://www.tangotiger.net/Scott/UZRcorr.html

It was written by Scott Fischthal (I of course misspelled his name in the google search engine, but google was smart enough to find him), and he says:

I think this shows a fairly impressive level of linear independence of UZRs among different positions on the field, especially for outfielders. The only ones that would make me a bit nervous are SS/3B and SS/2B. SS/3B is particularly troublesome in that the correlation is negative, which would imply that putting two strong fielders next to each other at SS and 3B could cause one of the fielder’s UZRs to be suppressed a bit. Or, it could just imply that teams don’t find it necessary to put great defenders at both SS AND 3B. Who knows…

It’s likely that even if we’re having a tough time at cracking the sharing-zones nut, it’s overall effect is rather muted.


#57    Guy      (see all posts) 2007/05/24 (Thu) @ 14:52

Or, MGL is already calculating UZR in a way that minimizes the impact of ballhogs on the ratings. 

It would be interesting to know if something like RF also shows so little correlation in the OF. Although, if there are only a few extreme hogs, this just may not show up in a league-wide regression.


#58    Pizza Cutter      (see all posts) 2007/05/24 (Thu) @ 15:45

Shane/55 I read through your paper.  Nice to see that others are into using hierarchical models!  In your outfielder arm evaluation, I saw that you divided the outfield into sectors.  Does that mean that you also prorated how far away each of those sectors was based on the park in which the game was played.  (Deep center field in Old Tiger stadium is different than deep center in… well, anywhere else...)

Also, with catchers, I’ve done some work that suggests that you need to factor in whether or not a pitcher has thrown over to first, and common sense says that some sort of speed score as a covariate would make sense.  You noted that Pudge only has a very small percentage of runners try for steals against him.  Are they disproportionately the speedesters of the game?


#59    MGL      (see all posts) 2007/05/24 (Thu) @ 16:11

MB/54, excellent point!  I never thought of that. I might add that as an adjustment to UZR.  Yes, those things tend to even out in the long run (although with imbalanced schedules, things don’t even out as much as they used to as players do not face the same pool of opponents anymore).  There are some adjustments that are made to help with short-term data even though things tend to even out over the long-term.  IOW, with more and more data, there is less need for the adjustments.  Then there are adjustments that are made to account for biases that exist even in the long run, like park effects and to some extent handedness of opposing batters (since a player tends to play on the same team and teams tend to have a fairly stable proportion of LH and RH starters, at least in the moderately long-term).


#60    Shane Jensen      (see all posts) 2007/05/28 (Mon) @ 19:32

Hey all,

Just checking in to say that I have posted a link on the SAFE website to the unshrunken year-by-year SAFE values for each player.

http://stat.wharton.upenn.edu/~stjensen/research/safe.html

I have also tried to clarify the text about our methodology in some places that were causing confusion.


#61    MGL      (see all posts) 2007/05/29 (Tue) @ 02:44

I sent this e-mail to Shane, regarding something on the web site:

“Players that show inconsistent SAFE values across years get a resulting average that is shrunk more towards zero than players that show consistent SAFE values across years.”

Why would this be?  I have never heard of anyone doing this when regressing sample results toward a mean.


#62    Shane Jensen      (see all posts) 2007/05/29 (Tue) @ 14:23

It is true that some regression models shrink all individual values towards the mean by the same amount.  However, it is preferable to use some estimate of the variability of each value to shrink different values by different amounts.  Values with a higher variance should be pulled towards the mean more than variables with a lower variance.  The trick is that extra information about the variance of values isn’t always available.  Often, sample size is used as a proxy for variance: if someone has a .340 BA after 100 AB, you are more likely to pull their prediction down moving forward than someone who has a .340 BA after 250 AB.

For SAFE, we use the within-player consistency (or lack thereof) across years as a measure of variance for each player.  We are pulling all values towards the population mean (of zero), but we are going to pull certain players less if their SAFE values are very consistent across years.  As a illustrative example, compare the fictional raw SAFE values of two players across three seasons:

Player A:  +5, +5, +5
Player B:  +10, +5, 0

Both players have the same raw mean SAFE of +5 across those years, but we want to shrink these averages somewhat towards the population mean of zero.  Our model would pull Player B’s value towards zero more than Player A’s value, since we are a little more confident that Player A’s true ability is closer to +5 (due to the across year consistency). 

I think I’ve babbled on enough here, but that is the basic idea.


#63    Pizza Cutter      (see all posts) 2007/05/29 (Tue) @ 15:19

Shane, on a technical note, what method/stat are you using as the basis for within-player variability?  Are you using good ole standard deviation/error?  An auto-regressive function?


#64    tangotiger      (see all posts) 2007/05/29 (Tue) @ 15:40

I share MGL’s skepticism that you can infer “consistency” (lack of variability) to any degree that would have the 5/5/5 guy be regressed any differently than a 10/5/0.

What if you have a -15/0/30 player?  Wouldn’t the better inference be that this guy was hurt one year?

In any case, how much are we talking about here?  That the 5/5/5 guy is a true +4.2, and the 10/5/0 is a true +3.8?

***

Also consider a guy who is a “true” 80% outs, and he has a sample 500 BIP in a season.  For that season, 95% of the time, he’ll convert into outs 76.4% - 83.6%.  That’s +/-14 runs!

The “consistency” that you see can hardly be attributed to the lack of variability of the player, and more likely simply is noise.


#65    MGL      (see all posts) 2007/05/29 (Tue) @ 17:45

I would like to see the mathematical justification for doing what Shane does.  Pizza or any of the other statisticians?

I told Shane in an e-mail that I can easily model the situation with a computer model and see if that is true.  I can simply set up a ditribution of players with different true talents, centered around +5.  It doesn’t even matter what the distribution looks like, although I’ll make it roughly normal.  Than I can simulate 3 seasons and see the average true talent of those who are consistent and those who are not, given a certain mean in those 3 seasons (e.g., +5).  I am suspecting that it will be the same, but I could be wrong, given that I am no statistician.

Of course, in the model, the assumption is that a player’s true talent never changes.  If it does, then players who are less consistent will tend (ever so slightly I would think) to be those whose true talent fluctuates.  Whether that means that their sample values should also be more regressed to esatimate their true talent, I don’t know.  And even then, what true talent are we trying to estimate, given that it changes a little from year to year?  The average true talent over those 3 years?

I guess I could model that too, though it would be more complicated and I would have to articulate a lot of parameters and we would have no idea whether they were true in real life (like X amount of players change their true talents each year by Y amount, etc.).


#66    Shane Jensen      (see all posts) 2007/05/29 (Tue) @ 19:56

Pizza/63, I am using a random effects regression model.  The SAFE values in each year are regressed against a separate indicator variable for each player.  This type of model simultaneously estimates both the within-player variance of values, as well as a global variance across all players.  The book I use as a reference for these models is “Bayesian Data Analysis” by Gelman et.al. 

http://www.amazon.com/Bayesian-Analysis-Second-Statistical-Science/dp/158488388X

I don’t mind going into more detail about the equations involved if people are interested.  I would also be interested in other people’s approaches to averaging now that the raw year-by-year values are available on the SAFE website.


#67    Pizza Cutter      (see all posts) 2007/05/29 (Tue) @ 19:56

MGL, I read the website and they say that they used a random effects regression.  (How to explain this in one paragraph...)

The idea is that if you take a look at each individual player and the four data points that they have (2002-2005, in their system), you can see how much their scores vary by the slope that line makes.  For example, let’s take a player who went 2, 4, 6, 8.  That player has a nice straight line with a slope of two.  (Stat geek note: I wonder if they also let the intercept vary randomly… I’m guessing they fixed it...) Then, let’s say you’ve got someone who went 5, 5, 5, 5.  They have a personal slope of zero.  (You’ll note that both have the same mean.) Now these are extremely convenient cases that I cooked up for illustration.  Real data are seldom so cooperative.

I’m guessing that this is the basis for their measure of individual variability, and I must point out that’s a guess on my part given the (limited) info I have available.  If Shane wants to set me straight, he’s welcome to.  To be honest, I’m really confused on what they’re talking about myself.

I agree with Tango.  It’s more likely to be noise that would cause the variation than some sort of drastic shifts in true talent.  Tango’s example of the confidence interval on the binomial distribution is well-taken.  Everyone should probably be regressed in roughly the same way.  After all, with that much noise, how does one separate whether someone was especially inconsistent or especially variable in their luck?


#68    Shane Jensen      (see all posts) 2007/05/29 (Tue) @ 20:35

Thanks Pizza/67 for helping with some explanation.  I’ll check in again tomorrow with the actual equations I used which will hopefully clarify things further.


#69    tangotiger      (see all posts) 2007/05/29 (Tue) @ 22:15

And perhaps give the results using those samples:
5/5/5
10/5/0
30/0/-15

Assume 600 BIP for each season.


#70    Pizza Cutter      (see all posts) 2007/05/30 (Wed) @ 02:59

To those who are attempting to play around with this data set, a bit of a warning.  There are some odd data points in there.  Take a look at Ricky Ledee’s performance as a CF in 2003.  Despite having only 74 balls hit his way in CF, he managed to be 56 runs worse than the average CF that year.  So, according to this, he was worth 3/4 of a run less per ball(!!!) than the average CF.  According to B-R, in 2003, he made 64 putouts in CF that year, so that leaves 10 balls that apparently did a lot of damage.  (I realize that he wasn’t Willie Mays, but was he in a coma?) There’s a few more of the suspeicious outliers in there.  I’ve found that the data behave a lot better when I restrict the data to those with more than 100 chances at that position.

A few preliminary findings: I looked at the AR1 intraclass correlation for each position.  Over at StatSpeak, I did a column on what it is, but to briefly recap, it’s a measure of how consistent performances are across time within players.  An ICC of .7 can be read like a correlation of .7, and an ICC-squared like an R-squared.  .7^2 = .49, so roughly half of the variance in performance is consistent within players across the data in the set.  I restricted it to guys with at least 100 BIP toward their general areas.  I also used runs saved per ball instead of raw totals.

Results:
1B: .112
2B: .619
SS: .465
3B: .541
LF: .302
CF: .621
RF: .361

Second basemen are the most consistent from year in their performances, while first basemen were the most likely to vary.  Perhaps we might re-interpret that to say that performance at first base is much more the product of random fluctuations (might I say… luck?) and second base has much more to do with actual ability?

Even the biggest number, CF at .621 means that about 38.5% of the variance in performance is consistent within players.  That means that there’s a lot of noise to deal with even in the best case.  (signal:noise > 2:1)


#71    MGL      (see all posts) 2007/05/30 (Wed) @ 03:03

OK I ran a sim.  Here are the parameters:

I had 200 players with a mean true talent of .7 caught per BIP.  One standard deviation in talent was around .02 balls caught (I just made that up so that I can make a roughly normal distribution of true talent within the 200 players - I could have made the SD of talent anything I wanted).

So of those 200 players, I made one a true talent of .75 and one a true talent of .65 (2.5 SD from the mean).

I made 5 players .74 and 5 players .66 (2 SD from mean).

I made 26 players .72 and 26 players .68 (1 SD from mean), and 136 players were .70 in true talent.

Each player got 600 BIP per season and “played” for 3 seasons.  For each BIP for each player, I generated a random number between 1 and 100.  If the random number was less than or equal to 100 times their true talent caught rate (.70, .72, etc.), then they caught the ball.  Otherwise they missed the ball.

At the end of season 1, each player got 600 balls and caught X amount, centered around their true talent caught rate.  The variance around that caught rate should simply be the binomial variance.

At the end of 3 seasons, for each player I computed the variance of their actual catch rates for each season.  For example, if a player was .68, .73, .75, he had a high variance.  If a player was .72, .72, .73, he had a low (near zero) variance.

I then separated all players after 3 seasons into 2 groups.  One group were the high variance players and the other group were the low variance players. Both groups had around the same amoung of players.  The median variance, BTW, was around .00015 in catch rate.

I only looked at players who averaged around .72 caught rate for the 3 years (like a player who averages +5 runs for the 3 years).

Then for each group I kept track of their true talent rates.

I did this 1000 times, so that each player had 3000 seasons (1000 3-season bunches).

We expect of course, that the true talent of all players who averaged .72 for 3 years would be .72 regressed toward the mean of .70, or somewhere around .71 (50% regression, maybe a little more since we have so many average players in my 200 players).

The question is if the high variance group had a lower (more regression) average true talent rate than the low variance group.  The more that I think about it, I will be shocked if there is any difference.  I don’t think that one thing (the variance of their performances in 3 years) has anything whatsoever to do with their true talent rates.  IOW, I can’t think of any reason why the higher true talent guys would have more consistency over the 3 years than the lower true talent guys, given an above-average actual catch rate of .72.

Anyway, here are the results:

The high-variance (inconsistent) players had an an average true talent of .7087 and the low-variance (consistent) guys, .7094, essentially the same.

I think that if I run enough sims, those two numbers will be exactly equal.

Again, the assumption is that a player’s true talent never changes.  If we make the assumption that all players’ true talents fluctuate for whatever reasons (injury, age, etc.), then I suppose that the high variance players will be comprised of players who have more fluctuation in their true talent (slightly) but I still don’t see any reason to regress their sample catch rates more toward the mean than players whose true talent does NOT fluctuate as much.

So Shane, I am having a hard time seeing where you got this from.


#72    MGL      (see all posts) 2007/05/30 (Wed) @ 03:37

I ran regressions of UZR per 150 for all players with at least 100 chances.  I regressed 01 on 02, 03 on 04, and 05 on 06.

The y-t-y r across all positions was .501, which suggests a 50% regression toward the mean after only one year of UZR with an average number of chances of 257 per year accross all positions.

Broken down by position the y-t-y correlations and average number of chances per season are:

1B .295 164
2B .506 282
SS .435 316
3B .507 241
LF .758 219
CF .549 289
RF .308 232

If we adjust for the fact that the average number of chances per season is different, and convert or normalize each position to 300 chances (a chance is a ball caught by an average fielder), we get:

1B .433
2B .522
SS .423
3B .562
LF .811
CF .559
RF .365

There are only from 35 to 71 total “pairs” in the regression so their is plenty of sample error in those “r“‘s.

For example, for an observed r of .5 with only 50 pairs of data, the 95% confidence interval is .26 to .69.  So take those above numbers with a large grain of salt.

On the other hand, for all positions combined, I have 457 data pairs.  For a sample “r” of .501, the 95% confidence interval is .43 to .57.


#73    MGL      (see all posts) 2007/05/30 (Wed) @ 03:49

BTW, I just did the 07 UZR numbers so far.  Some notables are (UZR per 150 games):

Jeter -25
Griff -26
Reyes +27
H. Ramirez -36
M. Ramirez -42
Bonds -10
A-Rod +11
Everett +37
Betancourt -12
Ichiro +17
M. Cabrera -17
Crisp +13
Damon +5
Edmonds -7
Erstad +17
S. Green -16
Hardy +16
A Jones +3
Dunn -29
Peralta -29
Polanco +15
Rolen +40
Rowand +7
Soriano +45
Upton -32
Vizquel +20
Victorino +21
Weeks -31
Willits +28
Mike Young -17
Delmon Young +5

Sample size, sample size, sample size, warning!


#74    Rally      (see all posts) 2007/05/30 (Wed) @ 09:27

If there is or isn’t a reason to use more regression on the high variance players, running a sim isn’t going to show it, since we know that the only reason a .72 true talent fielder has a high or low variance in your sim is because of random noise.

What Shane is doing works if he can show that the variance tells you something real about a player’s true talent.


#75    tangotiger      (see all posts) 2007/05/30 (Wed) @ 10:07

Because of the variance we expect purely from luck (1 SD = .02 outs per play, or 10 runs total, with 600 BIP), it is incredibly hard to believe that you would have:
1 - a true variance among players that can even approach that level
2 - that you can infer it, based on whether a guy is 5/5/5 or 30/0/-15

And that even if all that were true, that this would tell you that the guy who shows 5/5/5 is more likely to be a “true” better player than a guy who shows 30/0/-15.


#76    Rally      (see all posts) 2007/05/30 (Wed) @ 10:55

I agree with all of that. My point is that using a sim isn’t going to tell you anything useful here.

We already know that in the sim the variance of these players is entirely due to luck.  If Shane wants to convince us that we need to regress high variance players more, he needs to show there is a reason why real life differs from that sim.


#77    tangotiger      (see all posts) 2007/05/30 (Wed) @ 11:02

My point #1 in post75 should read as “true variance for a given player (i.e., change in true talent level, year-to-year”. 

The true variance among players is also around .02 outs per play.

***

I agree that MGL’s sim won’t tell us anything.  What MGL has to do is create a model as such:
1. Create a “change in talent level”, with 1 SD = .01

So, 16% of players will have their true talent level increase by at least +.01 outs per play.

Once you create this model, then you have to track all the players that were “consistent” in sample performance, and figure out what their true talent levels were.

I’d bet they’d be the same.


#78    Peter Jensen      (see all posts) 2007/05/30 (Wed) @ 11:26

Here is my explanation of what Shane is trying to do.  As PizzaCutter explained in another thread, the equation for the summation of variances is not Var(obs) = Var(true) + Var(random), but is actually Var(obs) = Var(true) + Var(random) + Var(unexplained).  Any model of observed data is going to have variance from factors that have not been included in that model.  Shane is trying to get closer to the true variance by reducing the effects of the Var(unexplained) in his Var(obs).  Rally is correct that MGL’s sim, because it does not include a Var(unexplained) will not reveal whether Shane’s methodology is correct or not.

When we talk about finding a player’s “true talent” what we really mean is finding the best predictor of what that player will do in the future.  Although adjusting any single year’s observed data by “regressing to the mean” can be an important factor in predicting future performance, its relative importance will have to be weighed against other factors such as age, experience, and health of the player.


#79    Shane Jensen      (see all posts) 2007/05/30 (Wed) @ 11:54

So here is a bit more statistical detail and intuition behind doing individual-specific shrinkage.  Forgive the math notation, but hopefully it will be clear enough. We have raw SAFE values Y_ij for player i in year j.  We can model these observed values as noisy observations of the player’s underlying talent parameter theta_i.

Y_ij = theta_i + epsilon_ij

where epsilon_ij is a Normal random variable with mean 0 and standard deviation sigma_i.  Note that sigma_i is indexed by i, which means that I am allowing each player to possibly have a different standard deviation in their observed values around their true talent.  Now, we have another level of the model...we assume that the true talents theta_i are centered around mu with a certain variance tau^2. 

theta_i ~ Normal (mu, tau^2)

mu is the mean of SAFE values across all players.  The parameter tau^2 captures how much true talent does vary across the population of players.  With this two-level model, the best estimate of theta_i is a weighted average of the player’s average values meanY_i across years j and the population mean mu.  The weights are functions of the individual vs. population standard deviation: 

theta_i estimate = [ (n*meanY_i/sigma_i^2) + (mu/tau^2) ] / [(n/sigma_i^2) + (1/tau^2) ]

That formula is a bit of a mess, I realize, but basically, it means that if player A has a large sigma_i, then their personal average meanY_i will be downweighted more than player B who has a small sigma_i.  This in turn means that the player A is pulled more towards the population mean mu (which is zero in our case). 

The intuition for using shrinkage in many studies has been that we should be skeptical about values far away from the population mean.  I am just adding the extra assertion that we should be even more skeptical about players that have inconsistent values.  Some of that inconsistency may be “true inconsistency” in their true talents, but more likely, that inconsistency is due to the extra “unexplained” term that Peter/78 introduced.  If there is a large inconsistency, it suggests that something funny might be going on with that player, and so we should place less trust in their values.  Less trust translates into more of a pull towards the population mean.


#80    Tangotiger      (see all posts) 2007/05/30 (Wed) @ 12:03

Shane, what would be the result of my post 69?

Assume that the observed sample of all players is 1 SD = 10 runs per 162 GP.


#81    MGL      (see all posts) 2007/05/30 (Wed) @ 13:23

I am starting to come around to Shane’s way of thinking - a little.

Let’s say that for some players, for whatever reason, there is more noise than for other players.

Let’s even say that we have a computer bug such that some players will have random values for all 3 years.

Might not the inconsistent group tend to have more of these players with bugs than the consistent group, in which case you would want to regress them more toward the mean since the players with bugs (their values are worthless so we have to assume that they are average players) would need to be regressed 100% toward the mean?

I think this is more what Shane is talking about, but I am not sure.

I am also not sure that you can justify the different regressions based on any kind of model where one players’ true talents change more than other players’.  Even if certain players’ true talent did change more than others (in which case it WOULD show up ever so slightly in a higher variance for those players), I am not sure that there is any reason to regress their values more.  If a player is a +5,+5,+5, and we assume that his true talent stayed relatively constant over the 3 years, why would that true talent (say, going forward) be higher (less regression) than a player who was 0, -5, +20 with the assumption that his true talent fluctuated a little within those 3 years?  In fact, you could make an argument that you simply use a stronger 3-year weighting for players with higher variance.
In any case, I think it is a matter of degree. 

In any case, as Tango initially said, the noise in the variance among the 3 seasons of data is so high compared to anything else, that even if there were some justification for regressing more or less according to the 3-season variance, I would think that the difference in regression would be very small as to make it inconsequential.

Shane, how much are you varying the regression/shrinkage among players?  For example, what would the regression be for a +5, +3, +1 player versus a -10, +3, +16 player?


#82    Shane Jensen      (see all posts) 2007/05/30 (Wed) @ 13:35

I will definitely check in with some numerical examples later on.  MGLs analogy to a code bug is a good one.  In that case, you would want to shrink 100% towards the population mean, though of course in practice you never see this extreme amount of shrinkage.  As MGL says, it is matter of degree.

Also, as MGL (and Tango originally) point out, a weakness of my player-specific shrinkage is that there is only 4 seasons of data, and so the estimates of sigma_i for each player are themselves quite variable.  More seasons of data will eventually help this, but in the meantime, something more intelligent can probably be done to help estimation of the sigma_i parameters as well.


#83    Rally      (see all posts) 2007/05/30 (Wed) @ 14:47

Please use the term ‘regression’ instead of ‘shrinkage’.  Otherwise you’ll keep conjuring up memories of the former Yankees Assistant to the Traveling Secretary.

“THE WATER WAS COLD”


#84    tangotiger      (see all posts) 2007/05/30 (Wed) @ 15:48

Peter said:

When we talk about finding a player’s “true talent” what we really mean is finding the best predictor of what that player will do in the future. 

If you have data from 2003-2006, Shane is not looking to see what the guy’s true talent level was as of Oct 1, 2006 or Apr 1, 2007.  He’s not using a time component (i.e., +30/0/-15 is the same as -15/0/+30, according to Shane’s equations).

He’s looking to see what the guy’s true talent level was on Jan 1, 2005 (or more accurately, his average true talent level from Apr 1, 2003 to Oct 1, 2006, on the days that he happened to play).

***

I created a semi-realistic distribution of players and performance.  Here’s my distribution of players:
freq true
0.01 0.66
0.04 0.67
0.10 0.68
0.20 0.69
0.30 0.70
0.20 0.71
0.10 0.72
0.04 0.73
0.01 0.74

This means that 30% of my players have a true out rate of .700.

Using a binomial, you can then infer the true rates for various sample rates.  For example, given that you observe a .740, and that you have the above true distribution, what is the chance that this was put up by a true .74, a true .73, a true .72, etc… In short, what’s the average true rate, given that you observe .740.  In my test case, the average true rate was .720.  So, given that you observe .740, on average, it was put up by a .720 guy.  That is, 50% regression toward the mean.

In every observation, be it .690, .710, .740… whatever, the regression toward the mean was always 50%.  It wasn’t the case that at certain observation levels, the regression toward the mean would be any different.

When I changed my true distribution to this:
freq true
0.03 0.66
0.07 0.67
0.14 0.68
0.16 0.69
0.20 0.70
0.16 0.71
0.14 0.72
0.07 0.73
0.03 0.74

My regression toward the mean equation did change for each observed level.  In this case, a .760 observation would regress 54% toward the mean, as its true was .732.  A .670 observation would regress 65% toward the mean, as its true was .680.

So, if you had an observation of .760, .700, and .670, the underlying true rates for those are .732, .700, .680, for an overall true average of .704.

Those three observations averaged an observation of .710.  If we had three exact observations of .710, the regression would be 69%, implying a true rate of .707.

It looks like you do want to, theoretically, regress each observation point separately.  However, it hardly seems worth the effort.

Let’s see how the numbers compare for Shane with and without his extra regression parameter.


#85    Peter Jensen      (see all posts) 2007/05/30 (Wed) @ 16:32

Tango - I understand that Shane is trying to estimate a player’s average true talent over the time period of his study.  But a player’s average true talent has absolutely no real world utility except as a means of estimating his present true talent and subsequently his future performance.


#86    tangotiger      (see all posts) 2007/05/30 (Wed) @ 16:53

Peter, I don’t necessarily disagree with your point.  I think we’re on the same page now.


#87    MGL      (see all posts) 2007/05/30 (Wed) @ 17:15

Tango, I don’t understand your concept.  If a player has 3 observation points, or however many, you want to take the average and then regress.  Why are you talking about regressing separately?  Are you assuming that the player’s true talent definitely or may change for each of the observations?  If yes, then you have to include as a parameter the distribution of possible changes in true talent.

BTW, you are doing the rigorous solution to the regression problem when we know the exact distribution of talent in the population.  It is a Bayesian probabilty problem.  Whenever we use regression euqations, it is an approximation (best fit) to the rigirous solution, and in fact assumes a symmetrical, roughly normal distribution of talent in the population.

Whenever we do projections, we really want to know the exact distribution of talent in the population the player comes from and then do the exact Bayesian math.  If that distirubtion is not symmetrical and roughly normal, I don’t think that a linear regression is the rigorous solution, not to mention the fact that the real distriubtion may not even be smooth and/or continuous, although in baseball I am sure that it is.


#88    tangotiger      (see all posts) 2007/05/30 (Wed) @ 17:30

I was trying to show that if you take three of the same points (that are close to the mean) and three other points, which average to the same, but are farther from the mean, that their true rates are different.

That is, 3 observation points in one group and 3 observation points in another group have the same mean.

The underlying true rates of those points end up not having the same mean.

However, the impact of such a difference is trivial.

I assumed all the points were independent, which is not exactly what we are talking about.

Your point about the regression being an approximation is well-taken.  The approximation holds very well, in this case, and therefore, I don’t see any reason to worry about the “consistency” issue.


#89    Pizza Cutter      (see all posts) 2007/05/30 (Wed) @ 19:09

Shane, I finally have a few spare moments to take a look at the explanation.  Looks like you’re using a pretty standard multi-level model, although I do have a few concerns, one methodological, the other in the formula. 

Methodologically, it appears that you are allowing sigma_i to equal the standard deviation (squared) of a player’s SAFE scores over the 2, 3, or 4 years that he’s in the database (was there weighting involved in this or was just the straight up SD used?).  In #82, I believe you point out that this is a weakness, but I’m really really concerned about this one.  Standard deviations drawn from 3 or 4 observations will be wildly unstable.  With more observations, it will become more stable, but players with 10 observations are players who played in ten seasons, and suddenly we have a selection bias on our hands. 

The other question is the assumptions behind your error term (epsilon_ij).  The error term in any equation is, in reality, epsilon_ij + epsilon_0 (player specific error/unexplained “true” variance + expected random error).  Laying aside the small N problem, am I right in assuming that since the actual player SD’s should, in theory, pick up both types of error on an individual level (although we have to consider that over a small sample size, “random” error might not have had enough chance to shake out randomly… I will win the lottery eventually if I just keep playing, but I’ll probably die first) and that you consider this to be a better parceling out of the total overall error variance term than just assuming equal and unbiased error across all players, the way a straight regression to the mean would work?

If that’s the case, then theoretically your model makes sense, although the practicalities of the small N problem are really getting to me.  I think the solution is that because there’s both random dumb luck lurking in there with (a lot of?) unexplained true variance, the model needs more parameters to fish out some of that unexplained “true” variance.  My guess is that it can be found in park effects, age, etc.

One other minor point: Do you regress based on norms derived by position, or on the whole set of players?


#90    MGL      (see all posts) 2007/05/30 (Wed) @ 19:42

Tango (88), I see.  That is actually interesting and drives home the point that the regression as a function of sample size only is only an appoximation.  Using the rigorous Bayesian method, where you know the exact distribution of talent in the population, will generate slightly different regressions (estimates of true talent) for ANY distinct set of observations given the same the same mean.  IOW, a +5, +5, +5 will generate a slightly different true talent estimate than a +6, +4, +5, etc.  But I agree that it is not worth the trouble and one of the problems is that we don’t know the exact true talent distribution, only an approximation, usually the mean and variance only.  Given that, I have no idea whether you would regress different unique data points differently, given the same mean. I guess that is one of the questions we are trying to determine.

We may be beating a dead horse here.  I think the lesson (Shane?) is to make sure you, one, explain EXACTLY what you are doing in your methodology (we had to pry out of Shane the fact that his numbers were regressed in the first place) and two, that once yo uexplain the entire methodoly, either give us the intermediate data/results or at least tell us exactly how the “adjustments” are done.  IOW, how much are you regressing (shrining) one player versus another.  If there is a large difference in regression among players with around the same number of historical BIP/opps, then that is most likely a mistake.  If the differences are small, then who cares?


#91    tangotiger      (see all posts) 2007/05/30 (Wed) @ 20:54

The entire exercise has been somewhat illuminating.  So, from that standpoint, this is a good thing.

For practical purposes, in terms of final presentation: Occam’s razor.


#92    dcj      (see all posts) 2007/05/31 (Thu) @ 01:33

Shane, if you are still around. Can you go over exactly how the SAFE numbers in the Excel spreadsheet are generated? I understand that for each player, you have a function giving his chances of making an out on any given BIP. Then you take the difference between this function and the league average function, and do a weighted integral based on an average BIP distribution.

How is this turned into the overall SAFE number? And in the spreadsheet, what do the numbers N represent? It seems like SS get about N=1300 in a full season, where a team’s total BIP is something like 4400. The numbers are a lot higher for IF than OF.

Pizza Cutter/70 brought up Ricky Ledee in 2003. Also Hiram Bocachica in 2004 and a few others. What is happening there?


#93    Shane Jensen      (see all posts) 2007/05/31 (Thu) @ 11:08

Tango/69:  I refer people back to Shane/79 for the equation for calculating the true ability theta_i.  Tango requested the calculations for the following three players. 

Player A: 5/5/5
Player B: 10/5/0
Player C: 30/0/-15

The population mean mu is zero, but we still have to have a value for the population SD tau in order to do the calculation.  In practice, this is estimated from the data, but Tango/80 suggests we use tau = 10. 

For Player A:

sigma_A = 0 and meanY_A = 5
theta_i estimate = meanY_i = 5 (no shrinkage)

For Player B,
sigma_B = 5 and meanY_B = 5 so
theta_i = (3*meanY_B/25) / (3/25 + 1/100) = 4.62 (some shrinkage)

For Player C,
sigma_B = 22.9 and meanY_B = 5 so
theta_i = (3*meanY_C/524) / (3/524 + 1/100) = 1.82 (lots of shrinkage)


#94    Shane Jensen      (see all posts) 2007/05/31 (Thu) @ 11:10

Pizza/89: First, the easy clarification...all of the shrinkage is done separately for each position. 

In practice, the values of theta_i do not end up being that crazy.  Take the example of Player C in Shane/93: this is more extreme than any observed case could get (and with 3 data points instead of the actual 4).  Even in this case, the shrinkage is slightly over halfway towards zero. 

The question of whether or not it is even worth doing individual specific shrinkage is certainly a good one.  I chose to do it because I wanted to be cautious about the people with extreme values.  I think that it is preferable to have a ranking that says Player A > Player B > Player C in the Shane/?? posting. If I want to assert that one of these players really is making a positive contribution in the field, I want to be as sure about it as I can, and I feel a lot more secure about Player A’s performance than Player C. 

Of course, as mentioned in previous posts, with more seasons of data we could also start estimating time trends.  As we all guessed, Player C could actually be Bernie Williams in disguise.


#95    Shane Jensen      (see all posts) 2007/05/31 (Thu) @ 11:11

Rally/83: I know what you mean!  “Shrinkage” kept me giggling through grad school. Good times.


#96    Shane Jensen      (see all posts) 2007/05/31 (Thu) @ 11:23

MGL/90, point well taken about being clear in the methodology.  When the SAFE website was first set up, it was geared towards a different audience (an undergrad class I was teaching) and the focus was not on the methodology.  It’s great that these methodological issues are becoming more of a focus now though, and I have this sophisticated audience to thank for that.  This is one of my first forays into serious baseball research and it has been illuminating to see what people see as important issues.  I am actually working on a full statistical manuscript on the entire SAFE methodology right now, and I will post it here when I get it into a readable state.  In the meantime, I’ll continue to chime in and clarify things, and I appreciate all the feedback!


#97    Tangotiger      (see all posts) 2007/05/31 (Thu) @ 12:43

Shane: thanks for running all that.  I find it hard to accept that you have zero regression toward the mean for the Player A.

In order to test what you are doing is accurate, we can look to see how each of those players did in Year 4.

Of course:
- we won’t have such clearly defined examples
- we won’t have anywhere near the number of players to make this exercise useful

However, we can still run correlations to this effect, perhaps even “bucketing” players into three distinct groups (consistent, wild, rest), and see how they did in year 4.

My guess is that will find the same amount of regression for each of these overall above average players.

***

One last test: can you create a player D of this profile:
x/0/-15
Solve for “x”, so that your overall regressed value comes out to “5”.

That is, how goshdarngood does this guy have to be in one year, to compensate for a -15 in another year.  I’m afraid to hear the answer.


#98    Tangotiger      (see all posts) 2007/05/31 (Thu) @ 15:12

Ok, I think I was able to plug in the formula.

When I put in
30/0/-15
I get
1.82

Changing the “30” to this value… I get that value…
30… 1.82
50… 2.40
70… 2.33
100.. 2.02
inifinity… 0.00

And what if I put: 30/0/0?  That gives me a regressed value of 5.

And 60/0/0?  That’s a 4, less than 30/0/0.

Uhhhh… unless I made a mistake, this doesn’t make any sense.


#99    Tangotiger      (see all posts) 2007/05/31 (Thu) @ 15:27

I published an Excel with Shane’s equations here:
http://www.tangotiger.net/files/individual_regr_to_mean_SAFE.xls

Feel free to verify my work.


#100    MB      (see all posts) 2007/05/31 (Thu) @ 15:37

I can’t get as deep into the methodology as you guys, but I’m a bit confused as well.

Wouldn’t 5/5/5 still regress toward the mean because we expect that the player is likely still somewhere closer to average than his performance over the sample. For example, 20/20/20 doesn’t regress at all toward the mean, either? So we expect this player is a “true” +20 player. From all I’ve read, that doesn’t seem right, but I could be way off here.


#101    Rally      (see all posts) 2007/05/31 (Thu) @ 16:12

I tried something like Tango suggested in #97 using TotalZone measures, but I’ve only got 4 years of data, and the sample is so small.  The consistent players did do a little better in Y4, but I hesitate to even type that, because its not a big enough sample to draw conclusions.

I just don’t think you can assume that just based on 3 years data, your observed variance can be used as the player’s true variance.  Even if you think players have different true variances.


#102    Shane Jensen      (see all posts) 2007/05/31 (Thu) @ 16:32

MB/100, you’re not way off!  The formulas that I presented do imply that if you have zero variance in your raw values, then you will not do any shrinkage at all.  In practice, this is a non-issue, since every player has some variance, and therefore each player will get shrunk. 

Tango/98, I checked some of your calculations, and I think that you using the formula correctly.  Here is my attempt at intuition (if there is any) behind the weird behaviour you’re getting with configurations like 50/0/0 and 60/0/0. 

As you increase one point while keeping the others the same, the variance for that player increases faster than the mean for that player.  As the variance increases, the amount of weight we put on that player’s observed data decreases and so the shrinkage towards the mean is actually greater.  The model is saying that the value of 30 within a set of 30/0/0 is skeptical but still within the range of believability.  What is believable and what isn’t is controlled by tau. A value of 30 is 3 SDs from the population mean, but that can certainly happen.  However, a value of 60 is 6 SDs from the population mean, which is essentially impossible in a Normal model. 

These situations aren’t really realistic for SAFE, but if you really did have data that looked like 60/0/0, there are a couple of ways of modifying the model to account for outliers like this.  One would be to use a t distribution for the data instead of a normal distribution.  The equations wouldn’t be as nice, but it wouldn’t downweight outlying values as much.  An even better approach, in my opinion, would be to model the sigma_i for each player as coming from a common distribution, so that you are shrinking both the players mean and players variance.


#103    Shane Jensen      (see all posts) 2007/05/31 (Thu) @ 16:41

Rally/101, just a couple of quick clarifications.  I’m using four years of data, not three, but your point would presumably remain the same.

I’m not really trying to assert that the sample variance is really a good estimate of the true variance with my shrinkage model. Rather, a large sample variance is taken as a warning sign of a potential problem with the SAFE values for that player (maybe the model fit in a particular year wasn’t good for some reason?). 

I’m penalizing a player with a large sample variance because I am skeptical of raw SAFE values that vary so much for a single player. 

I’m not penalizing that player with a large sample variance because I believe that he is a truly inconsistent fielder and that I was somehow able to capture that true inconsistency based on 4 years of data.


#104    tangotiger      (see all posts) 2007/05/31 (Thu) @ 17:03

Shane, even sticking to “realistic” numbers, your formula is saying that both these guys are true +5:

5/5/5 (sample mean = 5)
30/0/0 (sample mean = 10)

I’m saying it’s impossible.

If you had only one year of data, a sample mean of +10 would regress 50% toward the mean, to give you +5.  But, with 3 years of data, your regression would be closer to +7.5.  If you want to say that the regression approximation doesn’t account for the inconsistency, and that therefore he was probably more like a true +7.0 or +6.5, ok.

Similarly, a sample mean of 5, over 3 years, would regress to +3.75.  If you want to say it should really be +4.0, or +4.25, ok.

But, from where I sit, these two sample means, who differ by 5 runs, and that regression (over 3 years) would say they are a true 3.75 difference, I can accept that their “wild/consistent” numbers really ought to make it that their true should be 2.5 or 3.0 difference, then ok.

But to call both of them equals is simply unacceptable.  What I call the process that would make these two guys equals is “mathematical gymnastics”.  In essence, you can get whatever result you want by making plausible attempts, but that reach perplexing conclusions.


#105    tangotiger      (see all posts) 2007/05/31 (Thu) @ 17:12

I also want to point out that:
30/0/0 (regressed to +5.0)
is higher than
40/0/0 (regressed to +4.8)

Being 4 SD from the mean in sample performance is not that uncommon.  Everett, Erstad have each done it in the last few years.

In fact, a 22.5/0/0 and 40/0/0 gives you the same +4.8.

***

What about something more realistic:

30/15/0 (regression equation of +8.57)
31/15/0 is actually lower (+8.51)
and increasing 31 to 32, 33, etc, is progressively worse

How about this:
20/10/0 (7.5)
30/10/0 (7.5)

Sorry, it simply doesn’t work…


#106          (see all posts) 2007/05/31 (Thu) @ 17:37

I have no choice but to relent in the face of such an onslaught!  Seriously though, your last point brings up several interesting cases, such as the equality of 20/10/0 and 30/10/0.  I agree that these players shouldn’t be equal, and what is driving it is again the punishment of the model for large variances.  I do still think that the individual-specific shrinkage works for most configurations of values, but your point is well taken.  Have you already done the regular regression shrinkage for all of the SAFE values?  We should compare values and see which players show substantial differences.


#107          (see all posts) 2007/05/31 (Thu) @ 17:46

If I read everything correctly what Shane is trying to do is to adjust for the variance across years, but not regress to the mean as contributors to this threat would describe it.

Shane, perhaps you could answer this:

If we think about batting average for a minute and given 500 PA per annum what is the true talent of these two players:

-a) .400/.300/.200
-b) .300/.300/.300

Player b) looks consistent and a first glance our best guess for year 4 is .300. SAFE (for hitters) would say .300. Regression to the mean would say something closer to .290 (I’m guessing here).

What about player a)? Based on rally’s observations I’d say the variance probably adds a little, but not too much to the uncertainty. Let’s call it .285 (and the last year was probably injury related). What would SAFE say? Well, assuming it is centered around a line of .260 I’d guess you’d come out around .275ish.

Anyway, perhaps what we should do if use 3 years worth of data and use your shrinkage method vs normal regression to the mean to “guess” year 4 and see which is closer


#108    Tangotiger      (see all posts) 2007/05/31 (Thu) @ 17:57

Shane, before I can do that, can you explain something in your spreadsheet. 

Specifically, Hiram Bocachica, CF, in 2004, has an “N” of 65, and a “safe2004” of +41. 

I’m guessing 65 is the number of flyballs, maybe multiplied by the distribution at which the average CF makes an out, relative to the other two OF positions?  (That is, maybe there are 160 flyballs when Hiram was playing, and 65/160 represents the % of FB that CF make an out on?)

What does the +41 represent?  That he made 41 more outs than the average CF given:
a. 65 FB to CF
b. 700 FB (or 162 games worth) to CF
c. ??

I’m guessing it’s b?


#109    David Gassko      (see all posts) 2007/05/31 (Thu) @ 18:55

Shane,

I don’t have any thoughts that Tom hasn’t already covered, but I do want to say that your openness and willingness to both discuss and alter your system is a breath of fresh air, especially for an academic. If only every baseball researcher could do that!


#110    tangotiger      (see all posts) 2007/05/31 (Thu) @ 19:10

I agree with David’s sentiment.

You have some academics, like my co-author Andy, who treat you as a peer, even if you don’t have the “certification”.  Others are the snooty kind who expect you to put out a journal article before they even will consider your ideas.

It reminds me of a story I heard from Weird Al Yankovic.  (How’s that for a change up?) He was at an awards show, and he was scheduled to sit in the same row as Prince.  And, Prince’s “people” gave out instructions to everyone in the row that said “You are not to make eye contact or speak to Prince”.  Weird Al was so taken aback with that, that he sent out his own instructions that Prince was not to speak or look at Weird Al.

Did Matt Damon not teach any of those professors anything?


#111          (see all posts) 2007/06/01 (Fri) @ 13:38

I would love to pull off a line like “You are not to make eye contact or speak to Shane”, but I don’t think I would be able to keep a straight face.  I’m very keen on having something that is actually useful, and so I consider myself fortunate to be getting so much feedback from people with experience in modeling baseball.  My hope is that the academic and non-academic community can come together a bit more in the future regarding baseball research.  Open-access journals like the Journal of Quantitative Analysis in Sports is a great start.

I am meeting later today with my student that was responsible for the technical coding of SAFE, and I will ask him about the issues raised by Tango/108 and DCJ/92.


#112    Tangotiger      (see all posts) 2007/06/01 (Fri) @ 14:10

This comment is from Andy Dolphin
=========================================
Shane, the issue is that the player’s observed sigma isn’t his “true” sigma.  Just like the issue of regressing player skills, one would have to regress player consistencies.

In addition, you need to account for the statistical noise.  So, the sigma^2 in the regression calculation should really equal sigma_i^2 + random_noise^2, where random_noise^2 is the appropriate calculation for the stat in question, such as OBP*(1-OBP)/PA for on-base average.

Right now you’re assuming statistical noise is zero, when in fact statistical noise makes up the lion’s share of the total sigma—to the extent that the player sigmas can be assumed to be zero with no real impact on the results.

-- Andy


#113    Shane Jensen      (see all posts) 2007/06/04 (Mon) @ 11:53

First, to address the issues brought up by dcj/92 and Tango/108, which both pertain to what average BIP distribution we use to calculate our raw SAFE values.  Tango/108 was correct in his guess that we weight all our integrals so that they represent a seasons worth of BIPs.  Specifically, we use the following BIP counts:

1200 flyballs, 1800 grounders, 800 liners

I believe that these BIP totals should be close to the right number of BIPs for a 150 game season (which was close to the average number of games across fielders).  These totals are then multiplied by the overall BIP density (and run consequence density) when doing our integration, so a greater number of those 1800 grounders will definitely be going to the SS compared to 1B, but many of the BIP going to 1B will have higher consequence. 

So, each raw SAFE value is calculated assuming the same number of total BIP, which is why players like Hiram Bocachica and Ricky Ledee can pop up despite limited playing time.  This is also why it was important to keep track of the actual number of BIP hit to these guys, which is also in the spreadsheet of raw SAFE values.  You don’t want to put too much faith in the SAFE values for players with a small N. 

A good next step would be to incorporate these N values into the shrinkage model.  As Andy points out in Tango/112, we should be accounting for random noise in our sigma_i, and these N values give us information about the magnitude of that random noise for each player.  In fact, as I type this, I really can’t believe we didn’t do this the first time through.  Ah well, you can’t get everything right the first time.  I’m hoping that a better shrinkage system will evolve out of our discussion here, and thanks again to everyone for their input.


#114    Tangotiger      (see all posts) 2007/11/13 (Tue) @ 11:13

I saw that SAFE was linked at BTF, so I’m bumping it for anyone looking for this discussion.


#115    tangotiger      (see all posts) 2007/12/07 (Fri) @ 19:14

http://stat.wharton.upenn.edu/~stjensen/research/safe.html

Message from Shane:

Hey guys,

In case you’re interested, I just updated my SAFE website with new SAFE values for the 2002-2005 seasons.

http://stat.wharton.upenn.edu/~stjensen/research/safe.html

The changes come from improved methodology for estimating the individual fielder curves as well as better estimation of shared consequences as well as other minor things.  I’ve also removed the controversial shrinkage-averaging across years...the averages on the website are now just weighted by the numBIP faced for the player in
each year.  The website also has a link to an excel file with the year-by-year values.

Shane.


#116    Rally      (see all posts) 2007/12/07 (Fri) @ 23:58

How should I interpret the run values?  Runs per 162 games?


#117    tangotiger      (see all posts) 2007/12/08 (Sat) @ 09:22

I have no idea, which is why we are better off dealing with the spreadsheet provided.


#118    Shane Jensen      (see all posts) 2007/12/08 (Sat) @ 21:16

The SAFE values are scaled up to be runs for an entire season.  There are a few notes about this in the Excel spreadsheet that Tango referenced.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Jul 04 11:00
Mapping IDs

Jul 04 01:40
BPro Idol

Jul 03 01:39
sUZR v bUZR

Jul 02 21:15
Batting Order and the pitcher

Jun 30 07:22
NHL draft analysis and spreadsheet 1994-2009

Jun 30 04:14
The Poz goes FJM on Harold Reynolds’ a$$ - gather around the kids

Jun 30 00:11
Blogosphere Question of the Day, 06/24; OR Why should OPS die?

Jun 27 16:04
Loss aversion in golf

Jun 26 16:30
Donald Fehr

Jun 26 14:04
Barry Code