THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, September 27, 2010

Secret Sauce?  No more!

By Tangotiger, 10:18 AM

Colin is doing what’s right:

I took a look at how often the team with the best sauce won each series, from ’06 (the first season the Sauce was used) through ’09, and the result was only 54% success – not significantly better, statistically speaking, than flipping a coin.... It’s possible that the Secret Sauce ran up against a few years where its performance was flukily low. But what we have is a model based on historical data that has thus far been ineffective at predicting results out-of-sample, which doesn’t give us a lot of reason to be confident in it going forward. So for now, we’re retiring the Secret Sauce.

I think Colin should have been more forceful and acknowledged that regardless of what the Secret Sauce said, it should have been retired.  As MGL noted, how does completely ignoring a team’s offense supposed to be a good thing?  The best you could have hoped for with indicators like this, is that the win% would go from .500 (all other things equal) to .510, maybe .520.  With such little out-of-sample data to work with, you could never achieve anything close to a reasonable uncertainty level.

Anyway, I love that an outsider like Colin is fixing right what is wrong. 


#1    J-Doug      (see all posts) 2010/09/27 (Mon) @ 18:05

Glad to see that they noticed. I really did admire the effort, even if the result wasn’t what it should have been.


#2    MGL      (see all posts) 2010/09/27 (Mon) @ 23:51

The funny thing is that the results of such a small sample of games since they came up with the “sauce” should not necessarily invalidate the theory if it were any good in the first place.  The real issue is that it was “secret crap” then, and it is “secret crap” now.

If we look at the next 2 years and find a decent correlation between pitchers’ BABIP in one year and the next, would that invalidate the DIPS theory, which is based on much larger samples of data?

When you have out of sample data, it is nice to be able to test that data against your theory, since when you come up with a theory based on historical in-sample data, you run the risk of making a Type II error due to data mining or publishing bias.  The out of sample testing is somewhat a check on that.

However, you have to be careful and take into consideration several things: The sample size of the out of sample data, the sample size of the in sample data, the chance of the integrity of the theory being contaminated by possible publishing bias or data mining, as well as good old common sense, which is really a proxy for Bayesian a priori probabilities, based on what we already know (such as of course the relative strengths of playoff teams’ offense influence their chance of winning the series, and any “sauce” which ignores that is incomplete at best).

For those of you that don’t know what data mining and publishing bias are (they are similar)…

Data mining is this:  I look at 100 different possible effects from a sample of data.  By chance alone, of course, I am going to find 5 that are more than 2 standard deviations from the “mean,” the mean being what I would expect if there were no actual effect.  So I go ahead and report that I found 5 (or 2.5 - one tail, or one even of the 5) effects (I don’t say that I looked for 100) and that the size of the effects that I found are significant at 2 sigma (2 standard deviations).  That is essentially data mining.

Publishing bias is similar.  I do 100 experiments where I test the existence of one effect in each experiment.  In 5 of them, I find an effect also significant at 2 sigma, but they really occurred by chance since I conducted so many experiments.

The insidious thing about publishing bias is that it doesn’t have to be me performing 100 experiments (and then presumably I know that I am eventually going to find some effect that appears non-random but isn’t). It can be, and usually is, 100 different and even unrelated experimenters.  If they only publish results when they find an effect, then we will see 5 published results which inadvertently attribute their positive effect to “something” rather than chance since it passes the 2 sigma (or whatever) significance test.

One of the ways to combat publishing bias is to insist that all results of all experiments be published.  Unfortunately that doesn’t happen for various reasons and it is not even that practicable.  That is certainly the case with casual venues like internet blogs, as opposed to prestigious journals.

Lots and lots of studies you read about suffer from these kinds of biases, which is one reason why so many studies are refutes when they are duplicated (another way in which the effects of these biases can be mitigated) or tested on out of sample data.


#3    J-Doug      (see all posts) 2010/09/28 (Tue) @ 00:20

I’m guessing that, really, the out-of-sample findings simply considered them to re-evaluate the formula from multiple angles, and upon re-evaluation found it wanting.

The formula had its in-sample problems as well. From 1995-2005 (all in sample) WXRL isn’t significant at all. I don’t know how far back Mr. Silver went in his model, but the model shouldn’t have included a variable that was insignificant for 11 seasons (n = 88 on the team level, n = 77 on the series level) from the moment that the playoffs expanded to the moment that they stopped collecting data. This is true even if the model didn’t suffer from obvious Type II error issues.

Moreover, Silver’s explanation as to why these variables were the only significant ones never made sense. If it is true that fielding, starter Ks and closer wins were more valuable in October, then you should see an increase in their relative beta coefficients—not an absence of significance for variables that obviously matter. Silver should know this, and maybe he does.

I’m willing to bet that Colin does.


#4    James      (see all posts) 2010/09/28 (Tue) @ 06:25

On a pedantic point doesn’t data mining and publication bias lead to Type I errors (false positives) rather than type II errors (false negatives). Which then fail to stand up when a different data set is looked at.
James


#5    MGL      (see all posts) 2010/09/28 (Tue) @ 11:04

Right, Type I. Thanks.


#6          (see all posts) 2010/09/28 (Tue) @ 12:56

It sounds like part of the issue is using the 2 sigma standard.  We’re so anxious to find meaningful relationships that we’ve accepted a standard which, in practice simply doesn’t serve us well.  I recognize that moving the standard to 3 sigma would require sample sizes that are often onerous. At minimum, it would seem that 2 sigma should be simply the threshold for getting in the door.  Rather than accepting the alternative hypothesis and moving on, we should treat the results as the first round of evidence, suggesting further research is worth the time—per James’ (4) suggestion.

The other issue is that statistically significantly effects aren’t always meaningful in practice.  It’s possible that secret sauce absolutely identifies a set of characteristics which positively correlate with winning in the playoffs.  But what’s the size of the effect?  If the effect is weak enough such that it has virtually no use in a given season and is dwarfed by idiosyncratic factors, then frankly, who cares?  By the time the sample gets big enough for the effect to materialize in repeat samples, it’s possible, if not likely, for the underlying conditions to have changed anyways.


#7    dave smyth      (see all posts) 2010/09/28 (Tue) @ 13:27

This is a question about how the small number of PAs in a postseason might affect certain types of players. For example, let’s say we have two teams who each hit .275/.350/.450 over a full season. That’s their true talent. But one strikes out 1250 times, the other only 900. Is there a reason to prefer one or the other team in a short series with only 200 PA?


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 11:53
Do pitcher’s reach back for velocity when needed?

May 25 11:33
“Why Kickstarter works”

May 25 11:32
Howard Stern

May 25 11:26
Lack of hustle during a game

May 25 11:22
What sabermetrics is NOT

May 25 10:58
Rooting for laundry

May 25 10:14
Largest demonstration in Canadian history?

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 24 17:04
Firefox, IE, or Chrome?