THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, May 24, 2011

Batted Ball Puzzler

By Tangotiger, 02:19 PM

Colin posts an interesting puzzle.  My answer below the fold:


Given the choice between one or the other, then I would definitely choose B.  However, I would use both, with greater weight on B at about a 60/40 split.

#1    Colin Wyers      (see all posts) 2011/05/24 (Tue) @ 14:53

I’d be interested to hear your thinking, if you wouldn’t mind.


#2    weskelton      (see all posts) 2011/05/24 (Tue) @ 15:01

Is the source of the “after” yet a 3rd source?


#3    weskelton      (see all posts) 2011/05/24 (Tue) @ 15:03

OK scratch that.  Silly question.


#4    Tangotiger      (see all posts) 2011/05/24 (Tue) @ 15:26

Colin: sure.  I didn’t want to influence anyone, but I guess posting here is “hidden” enough.

I started by only looking at players who had at least 300 contacted balls.  That was 303 players.

I ran a correlation of the “A” batted ball stats to BACON_AFTER and WOBACON_AFTER.  (WOBACON is not there, but I just did BACON_AFTER + HRCON_AFTER… I’m guessing this is close enough.)

And I repeated by running the “B” batted ball stats.

The correlation was higher with the B than with the A.

Then I ran a correlation of BOTH A and B against WOBACON, and it was higher still.

This tells me the following:
1. B dataset is likely better
2. Both datasets have value

***

Let’s for example say this is what the datasets are: The “B” dataset is the “perfect” one, the one that is completely automated.  And let’s say the “A” dataset STARTS with the B dataset, but then alters it based on human insight.  Now, if there was no human insight, then all of the changes would have been random.  And so, we’d have no extra correlation from the human changes, and so, using B, or B and A, in this case, would give us the same correlation.  But, if there actually was human insight, then they did do a good job in finding something, but they went overboard.  While they added in some places, they took away alot in other places.

This is just one illustration as to what Colin’s datsets could be.

Looking forward to seeing what the datasets actually are!


#5    Rally      (see all posts) 2011/05/24 (Tue) @ 16:39

Tango, are those correlations any better than just taking the prior results and regressing?


#6    Guy      (see all posts) 2011/05/24 (Tue) @ 16:50

Thus far, I don’t see much advantage to either data set if the goal is prediction.  I tried predicting BACON after, looking at those with at least 300 contacts.  I first predicted BACON-after using 3 prior elements: Contacts (BIP%), BACON-prior, HRCON-prior.  That got me an R^2 of .17 and a standard error of .035.  When I added the four A metrics, the predictions didn’t improve at all.  Same thing when I added the B metrics. 

However, based on what I’ve seen so far, if I had to choose I would definitely go with “A.” The reason is that the correlation between BACON (prior) and LDA% is .999.  So there is clearly something wrong with the B data.


#7    Tangotiger      (see all posts) 2011/05/24 (Tue) @ 16:55

If you are getting .999, then something has been forced.

My presumption is that things are observed, and not precalculated.


#8    Guy      (see all posts) 2011/05/25 (Wed) @ 06:00

I don’t read Colin as providing any assurances that the data is based on observation.  And as you say, it can’t be:  we don’t find .999 correlations in nature.


#9    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 09:06

Guy, I get an r=0.46 between BACON_PRIOR and LDA, as well as BACON_PRIOR and LDB.

I suspect you get r=.999 under a peculiar regression process that you need to be carful about.


#10    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 09:07

Guy, and if you just plot the data, you see you won’t get anything close to a straight line.


#11    Guy      (see all posts) 2011/05/25 (Wed) @ 10:21

Tango:
Maybe Colin changed the dataset at some point, and we’re looking at different data?  I downloaded his data again and started from scratch, and still get a .999 correlation between BACON_PRIOR and LDB_RT_PRIOR.  The regression equation is BACON_PRIOR = 2.784 * LDB - .2547.


#12    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 10:40

When I look at the resulting regression equation for A_PRIOR and B_PRIOR, you see that in B_PRIOR having alot of popups is a huge negative to a wOBA.  But in A_PRIOR, it’s almost irrelevant.

This is what it “should” be, in terms of wOBA:
LD: .75
FB: .40
GB: .20
PU: .05

That is, a line drive is like a walk, a popup is almost like a guarnateed out, a FB is better than average contacted PA, and a GB is worse than average.

Now, how do the regression equations for A_PRIOR and B_PRIOR look?

For A_PRIOR:
LD: .41
FB: .57
GB: .26
PU: .42

That is pretty crappy as an indicator.  I mean, the FB and GB are good enough, but the LD is practically useless, as is the popup.

Here is B_PRIOR:
LD: .26
FB: .86
GB: .30
PU: -.39

The Popup is the exact opposite in B_PRIOR as it is in A_PRIOR. 

If we simply do a 50/50 split, we get:
LD: .34
FB: .71
GB: .28
PU: .01

So, we see that the popups come out as it should.  The GB is pretty close.  The LD is useless, while the FB is too high (i.e., too many legitimate LD treated as a FB).  There’s no way that the standard FB is worth as much as a standard walk.  LD, yes, FB, no.

That’s why both datasets should be used.  I think it’s clear that at the very least, the definition of a popup is very different between the two.  The r is only 0.53 between A_PRIOR and B_PRIOR for “popups”.  Line Drives have a correlation of only 0.46, and that’s believable.  How can popups have that much difference as well?

***

And another thing: the spread in observed LD for dataset B is VERY LOW.  The standard deviation is only .022.  If everyone had the same LD skill, we’d still observe a spread of one SD = .048.

So, something very fishy is up with LD_PRIORB.  I don’t believe it’s observed.  I believe that it must have been altered.

Same thing with PU_PRIORB: the observed spread is one SD = .030, when we’d expect, if there was zero talent, to have one SD = .031.  Again, I believe that this figure has been altered as well.


#13    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 10:42

Guy/11: can you plot the data?  If you have Excel, you highlight the two columns, Click Insert, then Click Scatter, then select the one with points.  If you see a perfect straight line, then yes, r=.999.  If not, then you are doing something wrong with the regression.

(You can also include a “trend line” with that graph, which is the regression equation.)


#14    Rally      (see all posts) 2011/05/25 (Wed) @ 10:46

I can confirm Guy’s finding of the .9999 correlation.  That makes me think that dataset B is not a real dataset, but a fake dataset derived by the player’s actual results.


#15    Rally      (see all posts) 2011/05/25 (Wed) @ 10:50

What I did was simply open it in excel, and use the CORREL function between BACON_PRIOR and LDB_RT_PRIOR.  And I do get a straight line when I put them in a scatter plot.


#16    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 10:53

Maybe Colin changed the dataset at some point, and we’re looking at different data?

I can assure you this didn’t happen.


#17    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 10:55

Ok, I just redownloaded Colin’s data (10:49 AM ET).  I am looking at these columns:
BACON_PRIOR (Column D in Excel)
LDA_RT_PRIOR (Column F in Excel)

I’m looking at all 825 entries.

If I chart the two graphs, I get a blob.  Excel gives this as the trendline:

y = 0.4211x + 0.0729
R² = 0.2185

Did you guys mean LDB? Guy said LDA:
“The reason is that the correlation between BACON (prior) and LDA% is .999.”

Oooookkkay… yes, that’s a perfectly straight line.  So, that means that Colin took a player’s BACON_PRIOR, and converted it to an LD rate in dataset B.


#18    Guy      (see all posts) 2011/05/25 (Wed) @ 11:04

Tango:  Sorry, I had a typo and said contradictory things ("The reason is that the correlation between BACON (prior) and LDA% is .999.  So there is clearly something wrong with the B data.").  It’s LDB that has the .999 correlation.


#19    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 11:06

Seeing that my original correlation only looked at the 4 batted ball types, but that datasetB is in fact 4 batted ball types AND BACON_PRIOR, I decided to re-run the correlation to include BACON_PRIOR in datasetA.  In this case, I get r=.46, which is a huge jump.

datasetB now gives me r=.43.

That makes datasetA a little bit better.  Presumably, this is because we lost information in datasetB by the removal of the LD data in favor of the BACON data.

In light of this, I will have to reverse myself and go with datasetA if given the choice between the two.

If I use ALL the data, then r=.47, so datasetB adds just a tiny smidge of an improvement.

This tells me that datasetB is likely a subset of datasetA, but a bit worse because we lost the LD information.

It sure would have been helpful to know that one parameter was derived from the other!


#20    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 11:15

If I include all 7 parameters (the 4 batted ball types, and the 3 “performance” types), I get r=.57 for the A-based set, and r=.58 for the B-based set.

Looking at the coefficients in the regression, datasetB was definitely altered.  And, the improvement is barely there.


#21    Guy      (see all posts) 2011/05/25 (Wed) @ 11:22

Tango, I’m not quite following:  what do you mean when you report a correlation between all 4 BIP variables and BACON?  And I don’t think you can assume that only the LD data in set B is derived rather than observed. That could be true, but who knows? 

Going back to the original question, why do you want the “A” data?  Does it improve your ability to predict BACON after?  I didn’t see any gain.  Maybe it helps predict HRCON?  (I haven’t looked.)


#22    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 11:33

I’m looking at WOBACON_AFTER, which is really the only thing we care about.  There’s no reason to treat a HR = 1B.

As for the correlation I reported, it was the 4 batted ball parameters AND BACON_PRIOR against WOBACON_AFTER.

I also did all 7 parameters (the 4 batted ball parameters, as well as BACON, BABIP, and HRCON) against WOBACON_AFTER.  In that case, datasetA and B are very similar.

As for the derived issue: definitely the LD was derived, and I’m sure the others were also partially derived (but whether they are or are not is not really relevant to my point).  Seeing how the coefficient for the FB in datasetB jumped up and the PU dropped extremely low, my guess is that alot of LD hits were placed into the FB pool and LD outs were placed in the PU pool, in datasetB.  This is why we get the exagerrated coefficients for those two terms.  You can never get a negative coefficient for wOBA (the minimum is zero!), so when I saw PU of -.39, this tells me that something drastic occurred here.

And as I said, the observed SD of PU was the same as expected from random, so definitely something happened here.


#23    Guy      (see all posts) 2011/05/25 (Wed) @ 12:16

But here’s my question:  does dataset A add anything?  Or can you predict WOBACON_AFTER just as well using only the prior BACON, HRCON, and CONTACT?  I didn’t find that the “A” variables improved the prediction.  (I’m ignoring B since we know it’s contrived data.)


#24    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 12:54

If I use ONLY:
HRCON_PRIOR

I get r=0.55 against WOBACON_AFTER

If I use ONLY:
BACON_PRIOR
HRCON_PRIOR

I get r=0.56 against WOBACON_AFTER.  That means Batting average on contact adds just a tiny smidge.

If I use
BABIP_PRIOR
BACON_PRIOR
HRCON_PRIOR

I get r=0.56 against WOBACON_AFTER.  That means BABIP didn’t help.

If I use those three AND the A dataset:
GBA_RT_PRIOR
LDA_RT_PRIOR
FBA_RT_PRIOR
POPUPA_RT_PRIOR

I get r=0.57.

So, yes, the batted ball information provided almost no new information.

If I use the main three AND the B dataset:
GBB_RT_PRIOR
LDB_RT_PRIOR <-- extraneous
FBB_RT_PRIOR
POPUPB_RT_PRIOR

I get r=0.58.

As you can see: ONLY knowing the player’s HR rate (and that, on an average of only 75 contacted PA) gave us an r=0.55 for his rest-of-season WOBA on contacted PA.


#25    Rally      (see all posts) 2011/05/25 (Wed) @ 13:28

So just using the results you get .56, and adding in the BBdata you can get to .58.  It’s not much, but not surprising either. 

That’s the way projections work, where if Marcel gets you to .70 from the most basic stats, perfect knowledge of a hitter’s true talent can only get you to .75 because of random variation, and the good projection systems are somewhere in between.

I’m actually a bit surprised you can even get to .56 here, given that the prior data is such a small sample- an average of 73 contacts.


#26    Guy      (see all posts) 2011/05/25 (Wed) @ 13:50

I found that including # of initial contacts (i.e. hitter’s BIP%) slightly improved the prediction. With that included, do the A and B variables still add anything?

In any case, assuming that at least the A data is real (and perhaps some of the B data), I think this tells us that the analyses we often hear this time of year—player X is performing poorly but his low/high LD% means he is toast / due for a comeback—is not worth much at all.


#27    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 13:59

Whoah.  If I do like Rally says, then using HR per contact, and Contact per PA gets me to r=.60!

So, given the choice between
: contacts per PA
: BABIP + BACON + Batted ball dataset A + batted ball dataset B,
the former is better!

If I use EVERYTHING, I get r=.62.

So, yeah, just knowing those two little things is all we need.


#28    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 14:00

One thing I’d like to point out, without discussing this particular dataset (I will pull back the curtain at some point, but I don’t want to do so while there’s still useful discussion going on). When you add additional variables to the model and see the R2 improve by that little, are you seeing a meaningful relationship or are you causing the regression to overfit by introducing too many variables?


#29    Sky      (see all posts) 2011/05/25 (Wed) @ 14:00

I might have missed this part, but how well can you predict the in-sample and/or future stats with either set of batted ball data and NOT the prior stats? (Sounds like one data set is not real, so maybe just the other set, or only project/regress to WOBACON for that set.)


#30    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 14:04

Colin, it would be helpful if you can give us WOBACON_AFTER.  0.9 for 1B, 1.3 for 2B+3B, 2.0 for HR in the numerator, and contacted PA in the denominator.

BACON_AFTER + HRCON_AFTER is close enough, but I’d like to see accuracy improved by not presuming 1.0 for 1B+2B+3B (which is implied by BACON_AFTER).


#31    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 14:18

Focusing only on the two important parameters (CON_PRIOR, HRCON_PRIOR), I added the square root of each of those, as well as the square of each of those, and multiplying those two (so that we end up with 7 parameters based on those two, and r remained at .60. 

Therefore, linear additive of those two parameters is fine.

***

Colin: I’m all talked out here, so I think the Wizard of Oz should tell us what we’re dealing with.


#32    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 14:45

The note of Oz was in reference to pulling back the curtain.  Hopefully, no one took it otherwise.


#33    Guy      (see all posts) 2011/05/25 (Wed) @ 14:52

When you add additional variables to the model and see the R2 improve by that little, are you seeing a meaningful relationship or are you causing the regression to overfit by introducing too many variables?

That’s very likely.  None of the A variable coefficients are even close to being statistically significant. Nor is BACON_PRIOR.  Only contact rate and HRCON are significant.

And I agree, it’s time for Colin to show us how the bacon was made.....


#34    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 15:00

Guy: damn, I love your pun much better!


#35    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 15:06

Set A is batted ball data from Retrosheet, from 1996 through 1999. Set B was produced by a series of regressions created using out of sample data (I split the dataset in half, based on whether or not the player’s ID number in the database was even or odd). Only data from the _PRIOR set was input into the regressions. In order to find LD rate… well, Guy has already figured that one out. GB rate was primarily derived using:

GO/(GO+AO+HR)

Where air outs was all non-grounder BIP outs. A smattering of BABIP was included. Popup rate was derived similarly, and fly ball rate was just 1 - GB - LD - PU.


#36    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 15:24

Colin, I had a hard time following that.  Can you walk through an example using this player:
3d5d9fbf316315e4

That’s Mark McGwire 1998 from what I can tell.  You took his first 100 PA, of which 53 were contacted.  You took the Retrosheet data, which was 14 GB, 14 LD, 14 FB, and 11 Popups.  That’s dataset A.

Now, for dataset B, you derived it as: 13 GB, 14 LD, 22 FB, and 4 Popups. 

Essentially, you move 7 pops to flyballs. So, what did you do to go from 11 pops to 4?  Did you take the fact that he had 10 HR after 100 PA, and presume that such a hitter would only get 4 pops, not 11?

If so, I like the idea that you can recreate a batted ball profile that is more consistent with his actual performance.  But how far can you go with that?  Were you hoping that by recasting the line, that we might be able to learn more about the hitter?  That’s a good attempt if that’s the case.

As it turns out though, we don’t learn much more than just relying on the most basic of stats.


#37    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 15:36

I didn’t look at “actual” popup rate at all; I simply took HR_CON, GO/(GO+AO+HR) and BABIP to come up with “popup” rates.


#38    BenJ      (see all posts) 2011/05/25 (Wed) @ 15:42

Colin,

I derived full season H and HR totals from the Prior and After rates, and you already gave us PA.  From those three, it’s pretty easy to tell which players and seasons these are. 

Interestingly, there are a number of other players that are seemingly missing.  McGwire, Sosa, and Griffey’s 1998 seasons are all there, but Greg Vaughn’s 50 HR/156 H line from that season isn’t, as far as I can tell.  Also looks like Tony Gwynn’s 220 hits in 1997 is missing, though Lance Johnson’s 227 and Molitor’s 225 (both in ‘97) are there. 

Shouldn’t there be more than just the 825 players included in your file?  Maybe you addressed this before.  Am I missing something?


#39    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 15:44

Ben, I split the sample in half, so that I would have one data set for me to experiment with, and then another data set to share.


#40    BenJ      (see all posts) 2011/05/25 (Wed) @ 15:49

Gotcha.  I wonder about the correlations above on the other half?


#41    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 15:54

Colin/37: right, that’s what I said.  You used a player’s profile (HR, CON, BABIP) and came out with the appropriate batted ball profile that would match that.

And, you gain a little bit by doing that.

I showed r=.60 by simply doing a linear regression using HR and CON only.  And you increased it to .61 by creating an implied profile.

At the same time, I showed r=.61 by using the actual profile.

Finally, I showed r=.62 by using both the implied profile and actual profile (in addition of course to the base performance numbers of HR and CON).

So, you are probably adding value by creating that implied profile.  But the gain is very slight, and not what you’d want for the effort of doing it.

I like the basic idea of it though.


#42    Guy      (see all posts) 2011/05/25 (Wed) @ 15:59

I like the idea that you can recreate a batted ball profile that is more consistent with his actual performance.  But how far can you go with that?  Were you hoping that by recasting the line, that we might be able to learn more about the hitter?  That’s a good attempt if that’s the case.

I don’t see how creating virtual batted ball data could ever be helpful.  The theoretical value of such data would have to be precisely because it is NOT what we expect given the hitter’s wOBA, and as such contains some additional “true” information about the hitter’s current talent.  If we create the “right” distribution based on the hitter’s wOBA to date, then it will just give us the same prediction as the wOBA itself.

Imagine you created virtual FIP stats for pitchers after their first 50 IP, working backward from their observed ERA.  That can’t possibly give you a better prediction of future performance than just using the prior ERA alone. Right?


#43    Guy      (see all posts) 2011/05/25 (Wed) @ 16:03

So, you are probably adding value by creating that implied profile.  But the gain is very slight, and not what you’d want for the effort of doing it.

I agree with Colin’s suggestion this is more the result of overfitting the data.  Maybe some of the more sophisticated stats geeks here can suggest a way to test for this.  But even adding additional variables with random data will marginally improve your fit.

In fact, I think it’s logically impossible for the constructed data to truly improve the prediction, if I’m thinking about it correctly.


#44    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 16:06

I agree with Colin’s suggestion this is more the result of overfitting the data.  Maybe some of the more sophisticated stats geeks here can suggest a way to test for this.  But even adding additional variables with random data will marginally improve your fit.

These are good places to start:

http://en.wikipedia.org/wiki/Bayesian_information_criterion

http://en.wikipedia.org/wiki/Akaike_information_criterion


#45    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 16:14

If we create the “right” distribution based on the hitter’s wOBA to date, then it will just give us the same prediction as the wOBA itself.

What I’m thinking about is if you were to square root or square the data.  (Same with FIP)

Right now, the fitting of the regression is based on:
y = mx + b

But, what if it you get a better fit with:
y = mx^a + b

a is always 1 in the traditional linear regression, but it doesn’t have to be.

So, perhaps by recasting in a non-linear fashion the HR and CON parameters, we are finding a better fit to forecasting the future wOBA.

For FIP, I have no doubt that I could get a better fit using BB, HR, SO, IP in ways other than what I use.  First off, I know the conversion of wOBA to runs is something like:
(wOBA / (1-WOBA) )^1.5 times some constant

So, it would make more sense to first convert BB, HR, SO, IP into a wOBA line, and then use the above equation to turn it into an ERA.

That’s the line of thinking I have when I talk about recasting the performance numbers to try to get a better fit.


#46    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 16:15

I should note that DIPS is in fact the better way to do it (if you use BaseRuns), and that FIP is a shortcut for DIPS.


#47    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 16:16

So let’s say I want to predict “wOBACON” (BACON+HRCON, in this case). I throw all my variables into a regression - the observed results and the A and B sets. I get an Adjusted R-squared of 0.299989.

Now let’s throw out the A and B sets. R-squared drops to 0.292495. So the additional data was making our model, better, right? Probably not - the model with fewer variables performs better in BIC and Hannan-Quinn (slightly worse in AIC, I should note). So there’s at least a strong suggestion here that the number of parameters, not the content of the parameters, is what’s increasing the goodness of fit. And that’s most likely due to overfitting.


#48    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 16:27

Colin: are you saying that you could for example simply have 5 additional columns, all completely random, and in a particular situation, you could end up with a higher correlation, simply because you lucked into some random numbers for that case?

That given enough datasets, you can get r-squared = .298 or .302 or .299, such that overall you get your .292495, but you’ll have some be higher and some be lower.

I can certainly accept that.

Basically, the correlation itself has an uncertainty level around it, and so, .299989 might not be sufficiently different from .292495 to matter.


#49    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 16:40

That’s not what I’m saying. It has nothing to do with luck, or having enough data sets, or uncertainty level around the r-squared.

Again - our “kitchen sink” regression has an r-squared of 0.299989. Now let’s omit ONLY LDB_RT_PRIOR. I want to note that the kitchen sink model includes BACON_PRIOR, and that LDB_RT_PRIOR is defined as:

0.0915459 + 0.359218 * BACON_PRIOR

In other words, it is absolutely impossible for LDB_RT_PRIOR to introduce any new information into the model; the only reason Guy didn’t note a correlation of 1.000 is because of rounding to three significant digits. In other words, in Set B, line drive rate is TOTALLY USELESS as a predictor of future results once you have a player’s observed outcomes (and in fact, this is essentially why Set B ever existed in the first place, to have “line drive” data that is provably useless).

And yet? Adjusted R-squared drops to 0.295449. That’s because the model is being “overfit.”


#50    Rally      (see all posts) 2011/05/25 (Wed) @ 16:44

That sounds about right.  I could create a few thousand columns of random numbers, half of which should have negative correlation and half should be positive.  By chance, some of them will look like they have predictive value.

Out of the sample though (and I see Colin held back 1/2 of this group) those variables won’t predict anything.  Or at least are very unlikely to - the columns could get super lucky twice.


#51    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 17:06

And yet? Adjusted R-squared drops to 0.295449. That’s because the model is being “overfit.”

I’m surprised.  I would expect the correlation to be the same.  Does this has to do with “degrees of freedom”?

This is my dataset:

1    2    3    4     101 
2    3    4    5     102 
3    4    5    6     103 
4    5    6    7     106 
5    6    7    8     105 
6    7    8    9     104 
7    8    9    10     107 
8    9    10    11     108 
9    10    11    12     109 
10    11    12    13     110

The first 4 columns are all dependent on each other.  The last one is what we regress against.

Now, whether I use 1 of the first 4, or all 4, I get R Square 0.905381084.

That’s what I’d expect.

But, the Adjusted R Square of the one with 4 parameters is 0.518553719 and with one parameter is 0.893553719.

I have no idea what an “adjusted R square” is according to Excel’s Data Analysis package.  I presume that the degrees of freedom is causing the issue here?


#52    MAH      (see all posts) 2011/05/25 (Wed) @ 17:10

Colin, I haven’t had time to follow/analyze this thread closely, but what is your bottom line as to the usefulness of the A data, that is the Retrosheet trajectory data?


#53    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 17:15

Difference between r-squared and adjusted r-squared:

http://www.people.vcu.edu/~nhenry/Rsq.htm


#54    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 17:18

MAH, I would say that the Retrosheet trajectory data has not shown a meaningful ability to help us predict future events, relative to simply using observed results.


#55    Guy      (see all posts) 2011/05/25 (Wed) @ 17:19

To me, the potential usefulness of this data—emphasis on potential—is in identifying a change in a player’s approach (or perhaps health).  A lot is being made now of Jeter’s increased GB%, for example.  It seems possible to me that a sharp change in GB% might tell us, with only a fairly small sample, that Jeter is doing something different (or pitchers are doing something different to him).  Now, hitting more GBs is not necessarily bad.  But if such a change is accompanied by a drop in overall performance, it might mean that the performance change is more likely than usual to reflect a talent change.


#56    Rally      (see all posts) 2011/05/25 (Wed) @ 17:23

Tango,

If you just use the same value multiple times you’ll get the same r^2.  I think transforming the data through a function allows it to vary a bit, similar to how x and x^2 can both be used in a regression equation.


#57          (see all posts) 2011/05/25 (Wed) @ 17:23

You will never decrease your R-squared by adding more variables even if the variables are completely random.  That is why the Aikake information criterion, BIC, and others like it give a penalty for adding parameters.  At the very least, adding completely random parameters will keep your R-squared the same as before.

I work in the insurance industry, and with some of the liability losses so volatile, modelers are happy with models with R-squareds less than .05.  All the modeling best practices I’ve seen include a hold out sample to test your model on the way Colin has done here.  Actually, modelers will use something like 40-40-20 with the data.  40% to find what parameters to use, 40% to find coefficients, and 20% as a final how did we do test.

A lot of models in the industry are tested with lift curves.  Essentially predicted results are ranked and put into 5, 10, etc bins and then look at how much separation there is between the lowest ranked policies and the highest ranked policies in the hold out, or testing, data.  You can get a single number as a result by taking the area under the curve if that’s what you require.  It is also helpful if you don’t care about a certain portion of the population (e.g. this could be used because the top 60% of players will play no matter what and you just want to find separation between the bottom 40% to see who deserves playing time, so you would only pay attention to that portion of the graph to see if there is separation).


#58    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 17:33

Think of it this way, Tango. R-squared tells you how much of the variance your model explains, right? When you start adding too many regression coefficients, the regression starts “explaining” the random variance in addition to the true variance. That’s why it’s overfitting - you’re starting to explain things based on limitations of your sample, not a real relationship between variables.


#59    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 18:13

But those random parameters should just get a coefficient of 0, and therefore, won’t lower the r-squared.

***

Rally, run a regression of this:
y = mx + b
x y
1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
10 100

I get this:
y = 11 x - 22
r-squared = 0.9498

Now, run a regression against this:
y = mx^2 + b

r-squared = 1.000


#60          (see all posts) 2011/05/25 (Wed) @ 18:29

The random variables “should” get a coefficient of 0, but sometimes they don’t because the randoms happen to line up.  Kentucky wins the basketball national championship “should” get a coefficient of 0 for predicting whether the Yankees win the World Series, but sometimes random things align.

Using AIC, BIC, or adjusted R-squared will give a lower value for models with superfluous variables because they would say the random variables aren’t worth the complexity.  It’s the principle of parsimony.


#61    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 18:51

Right, I agree that they could get something other than 0.  The point is that the r-squared can’t go down.  The whole point is to maximize the r-squared, and so, it’ll fit whatever it can to do that.

If adding a parameter DECREASES the r-squared, then you may as well set the coefficient to 0, since that will automatically revert it back to the higher r-squared.


#62          (see all posts) 2011/05/25 (Wed) @ 19:02

If adding a parameter DECREASES the r-squared

Except that doesn’t ever happen.  Adding a parameter always increases the r-squared, even if only by a very small amount.  There is no such thing as a parameter that decreases the r-squared.


#63    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 19:59

Mike: read Colin/49.

Seems to me that the “Adjusted” r-squared may decrease, but the actual r-squared won’t.  So, I agree with you Mike, and was surprised by Colin/49.  That’s why I’m trying to figure out how that happens, and I’m presuming it’s some “adjusted” or other thing that’s happening.


#64    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 20:03

Tango, there are only two ways you’re ever going to come across a regression coefficient of zero. One is if you have two predictors that are correlated at one, or a series of predictors that correlate with another predictor at one. That’s either a singularity or multicollinearity:

http://dss.princeton.edu/online_help/analysis/regression_intro.htm

The other is if your regression model has an R2 of 1 without the use of the variable that’s being zeroed out.

Now, is it theoretically possible you could find two data sets with ABS(R) = 0.000000000000000000? Yes. Are you ever going to? I rather doubt it.

And of course in this case there are significantly non-zero correlations anyway.


#65    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 20:04

Tango, adjusted r-squared will never decrease with an additional variable, it just may not INCREASE with an additional variable.


#66    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 20:09

Isn’t Colin/49 saying that the r-squared decreased with the addition of the extra parameter?  Even though that parameter was correlated exactly with another parameter?

So, that parameter should have been given a coefficient of 0, and the r-squared should have been untouched.

I don’t understand therefore why the r-squared went down.  I don’t understand how this can be true:

“And yet? Adjusted R-squared drops to 0.295449. That’s because the model is being “overfit.” “

Why would the parameter not have been given a coefficient of 0, since Colin noted that it added zero new information.


#67    Colin Wyers      (see all posts) 2011/05/25 (Wed) @ 20:14

Ah, I see, you misunderstood me. The R-squared dropped after I removed (not added) the variable from the model.


#68    tangotiger      (see all posts) 2011/05/25 (Wed) @ 20:23

Ok, I thought I was going crazy.

So, we agree that as we add parameters, the r-squared can only go up.

Now, Colin is saying he added a new parameter, which, for all intents and purposes, was exactly the same as an existing parameter.  That new parameter should have been given a coefficient of 0, and leave the r-squared untouched.

Instead, the r-squared went up and a non-zero coefficient was given.

And the reason is some “overfit” that I have no idea what it means.

***

In addition, when I ran my test, I showed that the r-squared remained unchanged when I kept adding the same parameter, but the “adjusted” r-squared went down.

***

Everyone who thinks all of this is mathematical bullsh!t, please raise your hand!


#69    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 20:24

"the r-squared can only go up”

or stay flat…


#70          (see all posts) 2011/05/25 (Wed) @ 20:37

Adjusted R-squared can go down by adding another parameter.

From the link in Colin/53

Specifically, if the t-ratio for a predictor is less than one, dropping that predictor from the model will increase the adjusted R-squared.

The R-squared CANNOT decrease by adding another parameter.  The adjusted R-squared CAN decrease if the t-ratio of the variable is less than one.  This is because Adjusted R^2 uses degress of freedom.


#71    Tangotiger      (see all posts) 2011/05/25 (Wed) @ 20:42

Hence Tango/51 is confirmed… Good guess on my part.


#72          (see all posts) 2011/05/25 (Wed) @ 21:16

"Overfitting” is when you have more variables than you need.  If I ran an 825 variable regression on WOBACON_PRIOR on Colin’s mystery data (one for each player) then I would get and R-squared of 1.000.  However, this would not get a R-squared of 1.00 for WOBACON_AFTER unless every player performed exactly at their true talent every period.

If I did a one parameter model using the average of WOBACON_PRIOR, this model would also not be predictive of WOBACON_AFTER unless everyone was exactly the same.  The most predictive model is somewhere between even though it has a lower R-squared than the 825 parameter model and more complex than the 1 parameter model.  Adjusted R-squared, Aikake Information Criterion (AIC), Bayesian Information Criterion (BIC), and others like it are different ways to measure what the best number of parameters are.


#73    Tangotiger      (see all posts) 2011/05/27 (Fri) @ 07:03

Thanks for the explanation!


#74    Guy      (see all posts) 2011/05/27 (Fri) @ 09:40

So, is Colin going to tell us what lessons he thinks we should draw from this exercise?


#75    Tangotiger      (see all posts) 2011/05/27 (Fri) @ 09:55

I think I learned the most from Guy’s insight back in post 23.

If you read posts 23 to 27, this is a huge indictment against batted ball data, and a huge favor in simply looking at actual performance results, with the focus being splitting up that performance data.

Really, this brings us all the way back to Voros’ original insights that led to DIPS, and that is, to break up the performance results into binary outcomes.  Did he make contact?  If he made contact, did he hit a HR?

Just answering those two questions is enough to get us almost all the way there.


#76    Colin Wyers      (see all posts) 2011/05/27 (Fri) @ 10:08

Guy, I’m planning on writing up something longer for BP next week about the exercise. I do want to note that, while I was pulling the strings a bit, I did post this as a learning opportunity for me as much as anything (you can’t run a blind test on yourself, obviously).

I think the big takeaway for me is when Tango said:

Given the choice between one or the other, then I would definitely choose B.

B was real results, A was observational data. I think sabermetricians should be more forceful in saying that, given the choice between the two, we should choose results over description - and that to use description at all, we need to show that it significantly improves upon what we can get by using results alone.


#77    Tangotiger      (see all posts) 2011/05/27 (Fri) @ 10:20

I agree that, given only 100 contacted PA (about 75 BIP), that the non-HR results was slightly more important than the “description”.  As I said, in that case, it was about a 60/40 split.

Presumably, if I had say 50 BIP, then I’d have a 50/50 split between results and “description”.

So, it seems that it’s around that point where non-HR results takes over from description.

But including the HR really upsets the balance here, as it totally changes the dynamic.  The correlation really shoots up when you just know that.


#78    Tangotiger      (see all posts) 2011/05/27 (Fri) @ 10:21

By the way, this is for hitters only.

If you do this for pitchers, we’ll get a much different conclusion.

If this exercise was Colin’s attempt to stop the proliferation of trying to DIPS-ize hitters, then that’s a welcome exercise.


#79    Guy      (see all posts) 2011/05/27 (Fri) @ 10:38

If this exercise was Colin’s attempt to stop the proliferation of trying to DIPS-ize hitters, then that’s a welcome exercise.

I strongly second this (whether or not it was Colin’s intent).  While DIPS contained important insights, it has had three deleterious impacts on saber thinking over time.  One is an excessive infatuation with components, as opposed to aggregate outcomes.  Sometimes the components can tell you more, but there are also serious sample size problems that sometimes get ignored because of excessive faith that component results are “real.” Second, and related to the first, is a tendency to treat components in a dichotomous way as either “luck” or “skill.” In fact, all are on a continuum.  Pitcher HR rates have luck, and BABIP has some skill.  And third (less relevant here) is the idea that y-t-y correlation rates are a measure of how “real” a skill is.


#80    Guy      (see all posts) 2011/05/27 (Fri) @ 11:40

B was real results, A was observational data. I think sabermetricians should be more forceful in saying that, given the choice between the two, we should choose results over description

While I’m broadly sympathetic to Colin’s concerns about observational data, I think this poses too stark a choice.  Why do we have to make a choice?  I don’t think many people would say we should weight BIP-type data 100% and outcomes at 0%.  I guess JCB’s “PROPS” metric did that, and maybe some of the DIPS-type metrics do as well (I’m not sure).  But you should be confronting the strongest case for this data, not the weakest.  And the strong case is Tango’s:  that it adds SOME additional information beyond raw outcomes, and should receive SOME weight. 

It seems clear that stringers make reasonably consistent distinctions between GBs and non-GBs, and this distinction has importance for both pitchers and hitters.  How much does that data tell us beyond outcomes?  That’s a fair question, and the answer may be “only a little.” But there’s no reason we must necessarily choose results “over” the observational data.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 02:54
Largest demonstration in Canadian history?

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 25 00:36
Help needed with sticky issue…

May 24 23:50
Rooting for laundry

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story

May 24 09:41
Racial bias in card collecting: not the collectors, but the players on the cards

May 24 08:13
espnW for hockey: CBC’s WhileTheMenWatch.com