THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, October 24, 2011

Small sample sizitis

By Tangotiger, 10:59 AM

Someone asked me about small sample size.  I answered as follows:

==========================

It’s a fair question to ask.  Basically, the choice is presented as follows:

1. Octavio Dotel has faced Ian Kinsler 8 times in his career (and got him out 100% of the time).

2. Octavio Dotel has faced 3800 MLB batters in his career (and got them out 70% of the time).

Therefore, how much weight do we place on the 8 Kinsler PA, compared to the 3800 non-Kinsler PA?  Ian Kinsler does have something in common with everyone else: he’s a MLB player.  That is a huge commonality we have.  In that group of players are guys who are better hitters than Kinsler, but also quite worse than Kinsler.

So, we can limit the 3800 batters if you want down to the say 1000 batters faced that are about as good as Kinsler.

Now our choices become:
1. 8 PA against Kinsler
2. 1000 PA against guys as good as Kinsler is a hitter

The choice however is not either/or.  You can overweight the Dotel-Kinsler matchups, and I have NO PROBLEM with doing that.  How much do we want to overweight that?  Two times?  Five times?  Ten times?  Give me a number.

So, let’s say that Dotel-Kinsler tells us 10 times as much as Dotel-GoodHitter does in terms of giving us an estimate.  The 8 actual PA becomes 80 weighted PA.  You still have 80 weighted PA to add to the 1000 other PA in the pool.  That 8 actual PA is still only 7% of the conversation when weighted 10 times.

Not to mention the reality is that if you study it, as we have in The Book, the matchups are simply not predictive.  This is not a matter of opinion.  It’s a matter of fact. 

If someone ignores fact because they believe in their gut they are right, Colbert coined the word “truthiness” for that. I have no argument against truthiness.  By definition, those who argue based on truthiness can never be wrong.

Tom


#1    SaberTJ      (see all posts) 2011/10/24 (Mon) @ 11:24

Well said.  It’s astounding how often individuals are too lazy to check the facts of their argument, and maybe that’s because their bosses don’t either.  It seems like the story is more important than the facts.

Maybe someday announcers and writers (obviously this is not all of them) will be held accountable for their statements.  But for now, if what is said is entertaining and sparks debate that’s all the mainstream media seems to care about when it comes to sports.


#2          (see all posts) 2011/10/24 (Mon) @ 11:33

I’m old enough to remember when Earl Weaver first made a big deal of match-ups. There were 12 teams in each league then, and teams played every team in their division 18 times, and the others 12 times each year. The match-up numbers became large enough to tell something. Even Weaver said he didn’t pay much attention to them until he had 20 or more at bats to look at, but when Boog Powell was something like 2 for 61 lifetime against Mickey Lolich, it didn’t pay to write his name on the line-up card.


#3          (see all posts) 2011/10/24 (Mon) @ 12:19

Surely the key question is how big the variance is?

Take it the other way around, as the variation average PA : home run is greater than average PA : strikeout.

How many PAs would a batter need to demonstrate that he was better against a particular pitcher than against the league as a whole, assuming that he gets a home run on every PA against the pitcher in question?

My gut suggests somewhere in the 8-10 region.

Anyone fancy doing the math?  What is the sample size required to get for an all-HR sample to be statistically significant at the 5% level?  Bear in mind that this requires you to adjust for the fact that the hitter will hit against lots of pitchers in a given season, so 5% of those would look significant at the 5% level in a naive test.  I’m not sufficiently qualified to know what the correct test is (chi-squared? t? z?).

Then do the math again for the hitter striking out in every PA against that pitcher.  It will be a lot more PAs before it’s statistically significant than the all-HR sample.


#4          (see all posts) 2011/10/24 (Mon) @ 12:31

Richard/3, I believe you might be looking for something like the Bonferroni Correction or Sidak Correction to handle the problem with multiple comparisons.  However, the statistical significance is really a non-issue.  We care about the effect size.  If you have enough data support, many things will be statistically significant and practically useless because the effect size is trivially small (but not zero).  In particular, we are interested in the predictive value: splits that will have meaningful effect sizes on out of sample data.


#5          (see all posts) 2011/10/24 (Mon) @ 12:43

mickeyg13/4 The sample effects look big - 200/250 points of OBP or more if we have support for regarding the matchup sample as being dominant over the general sample.  But we don’t.

Also, we don’t have lots of data - that’s the whole point of these being small sample sizes.

My point was that it isn’t the sample size that will convince me that there is a big effect, it’s the effect.  If you’ve got a .300 OBP hitter who gets 1.000 against a particular pitcher, and that’s 20-20 then 20 PAs is plenty to convince me that he has a better read on that pitcher than average.

But if the sample is .400, then 8-20 doesn’t convince me of anything.  For a modest effect like 100 points of OBP, you need hundreds of PAs to establish that anything is happening at all.


#6    Tangotiger      (see all posts) 2011/10/24 (Mon) @ 12:49

Right, it’s not whether a NON-ZERO difference is statistically significant, but the magnitude of impact.

Even after ONE PA, you have a certain level of significance that going 1-1 is more predictive than going 0-1.  However, you’d regress 99.5% toward the mean.  So, does it really matter that the 1-1 guy would be forecasted to go .333 OBP while the 0-1 guy would be forecasted to go .328, with an uncertainty level of say +/-.030?

So, you go 10-10, and all 10 hits are HR.  A test of statistical significance will NOT say that 10-10 is the true talent rate.  The test will say that the observed 10-10 shows that we have somebody who is above-average. 

But, we don’t need to know if he’s above-average.  We need to know HOW MUCH above-average.

Hence, if you have Ian Kinsler, and is he happened to be 0-8 with 8 K against Dotel (he isn’t, but that’s the illustration), then what is our expectation of the next Kinsler-Dotel PA?

Well, it’s going to be that Kinsler is going to K at a higher rate than average, and Dotel will get a higher K rate than his average, but it’ll barely move the needle.

The ONLY purpose of matchup data is for tie-breaker situation.


#7    Tangotiger      (see all posts) 2011/10/24 (Mon) @ 12:57

If you’ve got a .300 OBP hitter who gets 1.000 against a particular pitcher, and that’s 20-20 then 20 PAs is plenty to convince me that he has a better read on that pitcher than average.

Yes, a better read.  We ALL agree.  The question is: HOW MUCH BETTER?

That is the ONLY question on the table here.  You have someone who is a true talent .300 who we OBSERVED at 1.000 over 20 trials.

What is our expectation of performance against this pitcher NEXT TIME?  .900?  .700?  .500?  .400?  .350?  .325?  .301?

Gives us the number.  Don’t just say “somewhere between .300 and 1.000”.


#8    MGL      (see all posts) 2011/10/24 (Mon) @ 15:03

People still don’t get it.  If you have little or no predictive value in the population, as we found in the book, than, no matter what you THINK you know, and no matter what SEEMS to make sense, the sample size barely batters.

We can talk about this until we are blue in the face, and we can explain it in detail in The Book, but we still get this:

Oh yeah, I understand, but:

“The match-up numbers became large enough to tell something.”

Tango, I don’t like that you undersell the lack of predictive value we found with batter/pitcher matchups. I mean, you even went so far as to try and increase the size of the samples by using “pitcher families” (OK, not quite the same thing as facing one particular pitcher), and STILL we found nothing. 

Can we please put this to rest?  No number of PA gives us any more than de minimus value (at most - it still might be NOTHING - given that we found nothing).  Use it as a tie breaker if you want - I don’t care.  But please don’t use it otherwise as a decision maker no matter how many PA it is based upon (obviously you can’t have hundreds of PA and even then, there is ZERO evidence that that would have predictive value).

Guys: If 50 or 100 PA have any practical predictive value, then we would find it in 20 or 30 PA.  We didn’t.  We found nothing. Nothing. N-O-T-H-I-N-G.

So, please, pretty please, stop telling us about Earl Weaver and your grandmother who are so smart that they wait for 20 or 30 PA. We looked at lots of guys who had 20 or 30 PA and found zilch. Nada.

We tell you (in The Book) how much to regress clutch (a lot) given a certain sample size.  We tell you how much to regress pitcher BABIP.  We tell you how much to regress windup/stretch splits.  We tell you how much to regress RHB platoon splits.  All of these are a lot even with a fairly large sample size. 

We didn’t tell you how much to regress batter/pitcher matchups.  Do you know why?  Because we found NO predictive value, i.e., no “skill”, i.e., all the various unusual results you see are likely due to random fluctuation, at ANY sample size.

And again, for the 10 thousandth time, you cannot use the argument, “You found nothing at 30 PA, but at 50 PA, there IS something.” If you find nothing at 30 PA, as long as your sample of players is reasonably large, then there is nothing or almost nothing at 50 PA or 100 PA.

Do we know all of this for a fact - i.e. with 100% certainty?  No!  We know virtually nothing for a fact when they are based on inferences derived from sample data, which almost all sabermetric tenets are…


#9    Tangotiger      (see all posts) 2011/10/24 (Mon) @ 15:29

I think the only way for me to sell it is to acknowledge that you CAN overweight, but even severely overweighting will barely make a difference anyway.

***

Dan Fox did something here:

http://www.hardballtimes.com/main/article/tony-larussa-and-the-search-for-significance/

His #1 guy from 2003-2005 with the most at bats, and a p-value under .05 in terms of seeing something was Jeter v Rodrigo Lopez, being 19 hits in 40 AB.

Jeter faced Lopez in 21 AB outside of that sample, and he got 7 hits.  Jeter going 7-21 sounds about normal to me.

#2 was Jack Wilson(!) owning Ben Sheets (!!), where he was 13-32 in that period.  Outside of that sample, Wilson faced Sheets for another 32 at bats, and he got 7 other hits.  Wilson going 7 for 32 (against a super pitcher like Sheets) sounds normal to me.

#3 in Dan’s list was Todd Helton against Odalis Perez (16-31).  In the other 12 at bats outside that sample, he get 4 hits.  Helton going 4 for 12 sounds normal to me.

I mean, we can go on and on here.  And I DID do the same thing in The Book, looking at extreme performances for 1999-2001, and seeing how they did in 2002.  And we saw no carryover effect.

What is needed here is less theory and more roll-up-the-sleeves, because that’s the only way for people to buy this notion.

It’s not like I WANTED to prove this to be true.  I went into The Book not knowing what I would find.


#10    Mr. Red      (see all posts) 2011/10/24 (Mon) @ 15:59

Here’s a hypothetical situation. The opposing team has Ace Pitcher on the mound. Ace is a lefty and is death on lefty hitters. Your starting left fielder is Lance Berkman, who plays subpar defense and sports a career .341 wOBA against lefties. You have two right-handed players on the bench, Player A and Player B. Both are average defensive players, and both have a career .350 wOBA against lefties. Player A is 9 for 20 with 2 walks, a double and a home run against Ace Pitcher. Player B is 2 for 16 with 1 walk against Ace Pitcher. Do you flip a coin as to which player you start over Berkman? Or do you go with Player A?


#11    Tangotiger      (see all posts) 2011/10/24 (Mon) @ 16:06

Mr. Red: do you understand what I mean when I say:

“The ONLY purpose of matchup data is for tie-breaker situation. “

In your scenario, you said that Player A and Player B have the same overall performance stats and fielding talent.  And the ONLY distinguishing characteristic between the two is that one did better historically against the pitcher in question.

So, I ALREADY pre-answered your question.


#12          (see all posts) 2011/10/24 (Mon) @ 16:13

Isn’t part of the problem with accumulating a sample size big enough to find any meaningful effect is that over that time period, batters and pitchers can adjust their approaches against one another?

If a pitcher’s getting me out every time on sliders in the dirt, I’m going to start looking for sliders in the dirt from him and learn to lay off of them.

If I find that I can’t climb the ladder with high fastballs with this batter, and I keep walking him as a result, I’m going to switch to a different strategy.

Same thing would apply more generally to groups of similar batters and similar pitchers.

Those who can’t make adjustments are going to find themselves shortly out of the league and out of the sample, unless their talent level is worlds better to begin with.

This argument about why we might not be able to detect batter-pitcher matchup skill is, of course, at the same time an argument against the usefulness of that kind of data for future prediction.


#13    Mr. Red      (see all posts) 2011/10/24 (Mon) @ 16:33

Tango/#11

My apologies, I must have missed the end of comment #6.


#14    MGL      (see all posts) 2011/10/24 (Mon) @ 16:40

"This argument about why we might not be able to detect batter-pitcher matchup skill is, of course, at the same time an argument against the usefulness of that kind of data for future prediction.”

Exactly.  It might be that the in-sample results were meaningful but that batters and pitchers do adjust. Who knows?  In any case, if it has no predictable value we probably don’t care why.

I guess I shouldn’t get annoyed when people bring these issues up time and time again.  They probably have not read The Book (what are they waiting for?) and they probably have not read these threads in the past…


#15    MGL      (see all posts) 2011/10/24 (Mon) @ 16:42

One of these days I am going to write a Primer for MLB managers - seriously.

Chapter X:  “Mr. Larussa (as a metaphor for most managers), throw those index cards away!”


#16    Tangotiger      (see all posts) 2011/10/24 (Mon) @ 16:44

Red: thanks. 

I can’t tell if people read what I say, and ignore it; or they don’t take the time to read what I say; or they do take the time to read, but then somehow it slipped them by.


#17    GeraldY      (see all posts) 2011/10/24 (Mon) @ 17:01

"It is difficult to get a man to understand something, when his salary depends upon his not understanding it!”
---Upton Sinclair

Here, substitute “scouts and announcers” for “man.”


#18    Bill      (see all posts) 2011/10/24 (Mon) @ 20:18

While I don’t disagree with Tango/MGL’s conclusions, GIVEN the approach they took in The Book, I think the supreme level of confidence expressed – virtual certitude – is unwarranted from a statistical standpoint.

Think about what you guys are doing: you are building your entire edifice on the notion of independent draws from a single population. That is your prior – there’s a single population from which you are drawing statistical conclusions (the variance of a binomial or multinomial population, etc.) assuming independence, and if you do not find enough variation in the population, then you proceed no further because you can’t distinguish signal from noise in order to tease out skill differences.

But on the face of it, it is NOT TRUE that your observed data reflect independent draws. In fact, there is almost certainly correlation in the data that you are not accounting for. When you are looking at, say, 5 seasons worth of data points for each of 200 players, you do not have 1000 independent data points, which is what your statistical approach assumes. You likely have both “within” correlation for each player across time, as well as “between” correlation (between the players) in each season. I guess you may strongly believe that these correlations somehow balance out, but that is certainly not a given.

There are many ways one could try to put some structure on the data that accounts (in different ways) for these types of correlations, so no, I don’t think there is a single “right” answer about the correct way to proceed. This is just another way of saying that ALL statistical approaches rely on some initial assumptions that may or may not be valid.

I just wish you guys would recognize that your approach relies on such assumptions as well, and please stop telling us exactly how much we need to regress various stats – your estimates may well be quite reasonable, but then again maybe they’re not given your assumptions.


#19          (see all posts) 2011/10/24 (Mon) @ 20:55

Andy did most of the statistical work in The Book. I am not qualified to address your issues.  I am not a statistician.  Andy is. He is probably on the same level as you are (maybe more, maybe less, I have no idea).

That being said, you are preaching to the choir about “certitude”.  My tone only is a reflection of my exasperation when I read things like, “Yeah, but what about 40 PA?”

I am afraid that no matter how you slice it, and I don’t need advanced statistical acumen to realize that, if you look at a bunch of players in one time frame who had ridiculously anomalous stats against a certain pitcher (in around 20-30 PA) and then you look at those same matchups in another time period, and you find that collectively they hit at a “normal” level, there is only one conclusion that can be drawn.  The certitude of that conclusion is only limited by the standard error of whatever statistical test was done, i.e., the sample size (both the number of players and the PA per player) in the in-sample and out of sample data…


#20    Bill      (see all posts) 2011/10/25 (Tue) @ 00:05

Yes, no disagreement there about any stats based on a small sample size—tells you nothing.


#21          (see all posts) 2011/10/25 (Tue) @ 14:21

I like to conceptualize it as a two times weighting for purely theoretical reasons. For instance, let’s say we have Kinsler’s 8 at bats and Robi Cano also has 8 at bats. Which player’s at bats are more likely to indicate what we should expect of a Kinsler/Dotel matchup? Kinsler, right?

Theoretically, a single at bat from Ian Kinsler will tell you more about Ian Kinsler’s next at bat, then any at bat from any other player.

This is all a confusing way of agreeing with Tom though, attaching meaning to small sample statistics is silly. It’s taking a theoretically sound idea and completely perverting it beyond recognition. Furthermore, these player-to-player match ups are theoretical construct that doesn’t really work outside of theory-land. Baseball players aren’t static, those ever changing skill sets will always confound any effort to evaluate player-to-player match ups.


#22    Tangotiger      (see all posts) 2011/10/25 (Tue) @ 14:32

There’s no question that you want to weight a Dotel-Kinsler PA more than you’d weight a Dotel-Cano PA, to determine the result of a future Dotel-Kinsler PA.

I agree that taking this concept, of overweighting the Dotel-Kinsler PA to say 100 times that of a Dotel-Cano PA is a ridiculous position to take without research.

You weight Dotel-Kinsler’s 8 PA as 800, and you keep the weighting of the other good hitters as a 1-for-1, so they total 1000 PA, and now, the 8 PA of Dotel-Kinsler tells you about half of what you want, and the other 1000 PA of Dotel-GoodHitter tells you the other half.

That’s one fantastic leap to weight it at 100 times.

But that is the implication here of the Small Samplistas.


#23          (see all posts) 2011/10/25 (Tue) @ 21:52

I actually can’t believe there is not a website or a simple spreadsheet that automatically calculates a players talent vs a specific pitcher or type of pitcher (RHP, power, etc). Someone could make a name for themselves by putting all the information in one spot.


#24    MGL      (see all posts) 2011/10/26 (Wed) @ 02:09

Jeff, that should be standard issue for all managers on all teams.  THAT is what the index cards should be used for.

All it takes is a component projection for the pitcher and batter, including platoon and G/F, and a copy of Excel, bootlegged or not.

You could even have not only each match-up but you could attach one number to them depending on the situation.  IOW, you have the component matchup numbers and then you combine them using lwts or wOBA but you use custom lwts (based on WE) for the bases/outs/score/inning.

Did I just give away the specifics of your million dollar idea?


#25          (see all posts) 2011/10/26 (Wed) @ 08:26

#24 - That is pretty much it. I am able to generally figure out the numbers myself, so I haven’t run the program myself. It is just one of ideas that hasn’t been implemented. I see it like WAR a couple of years ago. People had and idea, but Rally came out with his WAR dataset and took it to a whole new level.


#26          (see all posts) 2011/10/26 (Wed) @ 18:13

Jeff Z/23,

Yahoo Fantasy Baseball offers a dumbed down version of this (the actual data is behind the scenes though). You can find it through the matchups links. They are a good source of comedy at times but don’t provide any useful information. They also only display them for SP matchups.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 14:01
Pete Palmer’s new book: Basic Ball

May 25 13:18
Do pitcher’s reach back for velocity when needed?

May 25 13:04
“Why Kickstarter works”

May 25 12:51
Chad Curtis

May 25 12:40
Largest demonstration in Canadian history?

May 25 11:32
Howard Stern

May 25 11:26
Lack of hustle during a game

May 25 11:22
What sabermetrics is NOT

May 25 10:58
Rooting for laundry

May 25 02:38
NFLPA lawsuit against collusion