THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

Filter posts by...

 

Statistical_Theory

Wednesday, December 03, 2008

What would happen if the shootout period was 10 minutes, not 5?

By Tangotiger, 02:31 PM

This is going to be a math-heavy post.  Be forewarned.

Java Geek answers it most correctly, in this blog entry by James Mirtle, with data supplied by Gabriel Desjardins.  Gabe says that the number of goals scored in OT (4-on-4 hockey) is 7.14 per 3600 seconds, and that it’s fairly uniform.  In these kind of scenarios, it’s always best to answer the question: what are the chances of it NOT happening.  The chance of not scoring in each second is 1 minus 7.14/3600 (or .998).  If you have a 5-minute OT (300 seconds), then you take that .998 figure we just got, raise it to the power of 300, and that tells you the chance of the game still being tied.  That figure is 55% (meaning 45% of the time you have a winner).  If it was a 10-minute OT (600 seconds), then you do the same thing, but raise to the power of 600, not 300.  That figure is 30%, meaning that 70% of the time, you have a winner.

Another way would be to realize that with 62 goals and 54 shootouts in the 5-minute OT, then this means that you were tied 54 out of 116 times, or 47% of the time.  If you had a 10-minute OT, you square that number (that is, if the chance of not scoring per 5 minutes is 47%, then the chance of not scoring per 10 minutes is .47*.47), and you get 22%.  So, 78% of the time, you have a winner.

The short of it is, that if you double the amount of overtime, then you chop in half the chance of going into a shooutout.  It works out this clean, because per 5 minutes of OT, half the time the game ends still tied.

(8) Comments • 2008/12/03 • SabermetricsStatistical_TheoryOther SportsHockey

How to calculate the area of a baseball field

By Tangotiger, 11:03 AM

Suppose it’s 330 down the line, 375 in the gap, and 405 to CF, and the fence is smooth.  What’s the area of the baseball field?  I come up with a figure close to presuming one-fourth of circle of radius 370.  Can someone with knowledge walk me through the correct answer?

(28) Comments • 2008/12/03 • SabermetricsStatistical_Theory

Friday, October 24, 2008

First half, second half splits

By Tangotiger, 10:04 AM

Dewey was one of my favorite players.  He’s also played in only three all-star games, a shockingly low total for a player who is at least a borderline hall of famer.  Jim Rice has been in 8, and Fred Lynn in 9.  Dave Parker in 7.  Dave Winfield 12.  Andre Dawson 8.  Those are his peers, more or less.  I could go on, but I’d guess all his peers were in at least 5.  The reason, back when I was a kid and didn’t really study the issue, was that Dwight Evans would get hot in the second half, and so, lost out on the half-year/popularity All-Star game.  His career stats show him with about a 13 point improvement in wOBA in the second half based on around 5000 PA.  One SD is around 7 wOBA points, so, there may be something there, not only in my memory (13 points) but even in the significance of his second-half performance.  Then again, I cherry-picked him and so, we expect some players to be at the 2 SD level, just by chance.

We have a similar player in our midst: Johan Santana.  (Hat tip: Joe Poz reader.) On around 3000 PA, his wOBA difference is around 28 points or so comparing 1st and 2nd half.  One SD is 9 points, so he’s at THREE SD.  You expect to see 99.8% of all data at between -3 and +3 SD.  It’s certainly possible that Johan is among the handful of players that simply is at the extreme range by pure luck.  But, why does it have to be the best pitcher of this decade?

I’d love to see someone tackle the issue of half-splits (preferably using the All-Star game as the split) by this method and see who are the true extremes, and if the standard deviation of all the z-scores is more than 1, or equal to 1.

Tuesday, October 21, 2008

Odds of the Rays winning

By Tangotiger, 12:34 PM

BPro has the Rays chances of winning the series at 48.3%, meaning that each game, they have a 49.3% chance of winning.  This seems quite perplexing.  The Pct3 for the Rays at the end of the season was .593 and it was .536 for the Phillies.  I’m not exactly sure what Pct3 represents (I think it handles strength-of-schedule), but .593 is much bigger than .536.  The average Pct3 for the AL teams is .516 and for the NL teams is .486, which seems much too tight.  Anyway, I think Clay’s got a big bug here.  I’ve been using the average AL team as a .5267 and the average NL team as .4767, and those are almost certainly too conservative.  Anyway, we can bump up the Rays .593 by 11 points to make them a true .604.  And we can bump down the Phillies by 10 points to make them a true .526.  That makes the Rays chance of winning each game as .579, and the Series as 66.8%.  Again, this is not me, but based on using BPro’s data.

Cool Standings has the Rays at winning the series at .535, meaning each game is .516 for the Rays.  I have no idea how they figured that out.

MLB Playoff Odds, of whom I know nothing about, has the Rays at .545, meaning each game is .521.

Vegas is -135 / +115.  That means that you need to bet 135$ to win 100$ on the Rays (implied series win% of .574), and 100$ to win 115$ on the Phillies (implied series win% of .465).  You will note that they add up to 1.040.  So, knock out .020 from each, and you get an implied series win% of .555 for the Rays, or .525 for each Rays game.

Anyway, I will await word from MGL and Rally (and anyone else out there).  What I’d like to hear from you guys is:
1. True talent level for the players likely to play in the series, as a team
2. The true talent level you have for the average AL and average NL team in 2008

At this point, accepting the Rays having a .525 chance of winning each game, as Vegas implies, would make the Rays as much above the average AL team as it would make the Phillies above the average NL team.  That doesn’t seem right, but I stand ready to listen to the data.

Wednesday, October 15, 2008

Do you really want the best team to win the World Series?

By Tangotiger, 01:22 PM

Suppose the season was 1 million games long.  The best team in the league wins 60% of their games.  We know that that is just a sample, and that we are 95% sure that their true talent level is somewhere between .599 and .601.  Suppose that the second best team was a true .575 team.  They then meet in the World Series.  If they play long enough (say 1 million games), then the better team will win the World Series virtually all the time.  But, do we want that?  Setting aside the logisitics and silliness of 1 million games, do we REALLY always want the better team to win?  Or, do we want some lucky breaks every now and then to dictate the outcome?

This is my question to you: if a .600 team faced a .575 team in the World Series, how often would you really want the .600 team to win the World Series?

Q2: If a .600 team faced a .550 team, how often do you want the .600 team to win the World Series?

Q3: .600 v .500.  Same question.

Correlation is not causation, example #756,874,127

By Tangotiger, 01:12 PM

An easy way to write a story.

Hat tip: Aaron.

(5) Comments • 2008/10/16 • SabermetricsStatistical_Theory

Tuesday, September 30, 2008

HR rates by height

By Tangotiger, 09:37 PM

Reason?  Sampling bias.  Who are the players 6’5” and greater?  And do they appear in both sets (aging)?  You only have 10% of the sample, and so, much more likely for wild swings.  Create 4 groups of 25% of the ballplayers, and I’d bet you get smooth results.

Splitting the batting lines into binomial metrics

By Tangotiger, 10:56 AM

Pizza lays out the idea.  As studes noted, we talked about this alot in the past. 

What Brian suggests in the comments is the way I normally approach the problem, as the way Voros did it.  Here are my aging patterns by these metrics.

I also echo Pizza’s position on where to put the HR.  Sometimes I do it the way Pizza says it, and sometimes the way Voros says it.  The fact of the matter is that you can construct two equally plausible scenarios.

There is an undeniable relationship between K, BB, and HR.  There is also an undeniable relationship between HR and FB (and to a lesser extent LD).  The only rigtht way to do it is to model this relationship.  If for example you do it as Pizza proposes it, then you need to have an additional function on the HR/FB rate that includes the K and BB rate.  If you do it as Voros proposes it, you need to include the FB rate to apply to the HR rate.

Thursday, September 25, 2008

Who wants to run a regression for me?

By Tangotiger, 05:52 PM

Here’s the data:
Age PA1 PA2
24 2250 2100
28 2250 1900
32 2250 1400
36 2250 1000
24 1500 2000
28 1500 1300
32 1500 700
36 1500 350

Your job is to use the first two columns to estimate the third.  Ideally, we want no bias.

I was thinking of something on the order of:
a*(Age-b) + c*(PA^d)
I’ve tried
a=100, b=28, c=.83, d=1
a=100, b=28, c=36, d=0.5

I’m hoping someone can come up with the best combination of the above form.  Or, if you think you need to add an extra parameter, like PA*Age, or PA/Age, by all means, do so.

(25) Comments • 2008/10/01 • SabermetricsStatistical_Theory

Tuesday, September 23, 2008

Confirmation Bias

By Tangotiger, 12:26 PM

Depodesta:

So, there we sit discussing the skills of a highly qualified and tested group where the distinction between players is very, very thin. However, what becomes clear is that for the players we want to keep in big league camp, we generally talk about what they can do. For the players we want to send down, we tend to focus on what they can’t do, so the decisions seem obvious (which they’re not). Understand, I keep using “we” because every one of us in the room is guilty - we can’t help ourselves!

Later on, someone commented on how Jeter is such an example, to which I replied:

Read More

(19) Comments • 2008/09/24 • SabermetricsScoutingStatistical_Theory

Monday, September 22, 2008

MLB Playoff Race

By Tangotiger, 11:36 AM

One more site for odds of making the playoffs.  His help page doesn’t describe the method of figuring the true talent level of the team, which is of course where the whole ball of wax is. 

(8) Comments • 2008/09/23 • SabermetricsStatistical_Theory

True Talent v Sample Performance

By Tangotiger, 09:07 AM

Strength of schedule, using sample or true talent?

Best teams actually having the best record?

Stay tuned…

Thursday, August 14, 2008

Bayes’ Theorem

By Tangotiger, 10:13 AM

Victor looks at Bayes’ Theorem for prospect valuation.

(10) Comments • 2008/08/19 • SabermetricsMinors_CollegeStatistical_Theory

Tuesday, August 05, 2008

Observed Performance Inferring True Talent (OPITT)

By Tangotiger, 11:50 AM

I talked about this at length in the Edgar thread, so let me reserve this thread for more generic and technical arguments and presentation.

Let’s say you have someone who has a .380(*) career wOBA in 10,400 PA (16 seasons of 650 PA).  How many standard deviations (SD) is he from the league mean of .340?  Answer: 8.0

(*)For those new around here, a .380 wOBA is the same thing as a .380 OBP, with a corresponding profile of SLG, something like .475 or so.

A guy with a .380 wOBA in 10,400 PA is roughly +36 wins above average (WAA) and 69 wins above replacement (WAR).  This is around the discussion level of someone being a hall of famer.

Now, suppose someone has a .420 wOBA.  How many seasons does he have to play in order for us to say that he is 8 standard deviations from the league mean of 8.0?  Answer: 4 seasons.  That gives him a WAR of 26 wins and WAA of 18 seasons.

And a wOBA of .460?  A little under 2 seasons.  And a wOBA of .500?  Just one season, with a 11 WAR and +9 WAA.  That is a Bonds-like or Pujols-like season at their best.

So, is that enough?  Is it enough to say that your performance is 8 standard deviations from the league mean, in order for your Observed performance to infer great talent?

I don’t know.

Now, let’s try asking: how far away are you from a .300 wOBA level, which is right close to replacement level.  Here’s how that looks:

Read More

(11) Comments • 2008/08/19 • SabermetricsAwardsStatistical_Theory

Boys as smart as girls, but boys more likely to be at the smart (and stupid) end?

By Tangotiger, 10:59 AM

It’s the old college v high school draft, low ceiling v high ceiling.  Phil’s post leads to this discussion.

(2) Comments • 2008/08/06 • SabermetricsMinors_CollegeStatistical_Theory

Thursday, July 24, 2008

Experience, schmexperience

By Tangotiger, 10:01 AM

Studes:

I’ve got a two-year WPA list for batters involved in pennant races, broken out by age and time period (before and during the pressure-filled months).

As you can guess, nothing there.  I have no doubt that there will be something there.  (For example, in The Book, I noted that there is an age effect with a runner on 1B.  Young hitters aren’t as smart in taking advantage of the hole.) However, whatever we find will be some isolated skillset, something that will be real, but in the grand scheme of things, it’s like finding a 50-foot tree in a forest of 30-foot trees.  So, yes, it’s something real, it’s something noticeable, but when you’ve got a forest full of 30-foot trees, if you happen to find a 50-foot tree, it’s not like you’ve found a forest of 50-foot trees.

I’m good at data entry with the numeric keypad.  Really really good.  Or was anyway at one point.  My fingers would fly over those numbers.  But, when it came to typing words, and using the letters on the keyboard, I’d be average.  If you gave me 20 papers to type, and 19 was for a lawyer and 1 was for an accountant, I’d fly on one of them.  But, if all I get to do is expose my real skill 5% of the time, then won’t it be really hard to find that skill if you have 100 people’s results to look at, and you didn’t realize, or think to realize, that one paper might be filled with numbers?  And even if you did think to find it, you realize, “eh… it’s real, but it comes to play so little… how the heck am I supposed to find it?”

(3) Comments • 2008/07/24 • SabermetricsClutchStatistical_Theory

Tuesday, July 22, 2008

Intraclass correlation

By Tangotiger, 10:56 AM

Pizza talks about it all the time, and has a blog post to it.  But, darned if I know what it’s actually doing.  When I do my thing, I figure the z-score for the stat for each player (number of standard deviations, SD, from the mean), and then calculate the SD of the z-scores.  The correlation is r = 1 - 1/SDzScore^2.  So, if the SD of all the z-scores is 1.41, then r = .50.  If it’s 2.0, then r=.80.  I think this is what Pizza also does, and so, I guess I’m doing an intraclass correlation without even knowing it.  Regardless, what I do seems sound.  I like what Guy said in the comments in response to my comment:

Read More

(12) Comments • 2008/07/24 • SabermetricsStatistical_Theory

Friday, June 27, 2008

One of many ways that not regressing toward the mean can get you in trouble…

By , 12:14 AM

Here is a snippet from a BP article by Geoff Young about Adrian Gonzalez, the Padres slugging first sacker (I sound like a real baseball writer!):

So I decided to check out his age 25 stats (from 5/8/07 to 5/7/08) and see just how much he’d built on his success from the previous year. Using the same format from my earlier article, and with the help of David Pinto’s Day-by-Day Database, here’s what I found:
Adrian Gonzalez, Age 24-25 Age AB BA OBP SLG ISO XB/H AB/HR
24 598 .316 .376 .543 .227 .376 18.69
25 650 .282 .344 .498 .216 .432 22.41

Uh-oh. That wasn’t supposed to happen. I had it all figured out: Gonzalez was going to exhibit a slow but steady increase in skills, and the numbers would support what my eyes had led me to believe.

Unfortunately, reality had other ideas.

So, Young thinks that Gonzalez did not progress as a 24 year old should, given that his numbers (say, OPS) went down from .919 to .842, a significant decline.  But wait…

Read More

(15) Comments • 2008/06/29 • SabermetricsStatistical_Theory

Sunday, June 01, 2008

Why is the HFA so high this year?

By , 04:21 AM

I did not realize it was until I read this article on ESPN.com.  In it, Buster Olney takes on the question as to why the home teams are winning at a .577 clip so far this year (as of May 29, according to Olney).

In the article, Olney says, according to several GM’s, players, managers and scouts, it might because of the “influx of young players” who are more familiar with their home environments, or perhaps even party on the road, more so than the average player I guess.

As I said, I hadn’t even noticed that home teams are winning at such a high rate so far this year, and normally I wouldn’t think anything of it anyway.  But given the sample size, the difference between this year and what is typical (around 53-54%) is greater than 2 SD, enough to raise an eyebrow or two.  Plus, there are a lot of weird things going on in baseball so far this year (well, at least one weird thing, which is the low run scoring and especially HR rate in the AL).

Anyway, the “young players” explanation seems a little silly to me on its face.  I mean how many extra young players would it take to make such a difference?

Not one to accept anything at face value, especially that which “scouts, managers, GM’s, and players” posit, I looked at the average age and distribution of ages of pitchers and batters so far this year as compared to last year at the same time (thru May 29).  Each age is prorated by the number of PA or TBF.  The distribution of ages is percentage of total PA or TBF.  Here is what I found:

Read More

(21) Comments • 2008/06/02 • SabermetricsStatistical_Theory

Monday, May 19, 2008

Some of the things you wanted to know about Regression Toward The Mean, and didn’t know how to ask

By Tangotiger, 10:03 AM

Eli gives you a boatload.

Page 1 of 3 pages  1 2 3 >

Latest...

COMMENTS

Dec 03 21:29
Sabermetric Moves of the 2009 Pre-Season

Dec 04 00:23
Estimating BABIP

Dec 04 00:08
Avery being Avery

Dec 03 23:56
What would happen if the shootout period was 10 minutes, not 5?

Dec 03 23:36
How to calculate the area of a baseball field

Dec 03 23:25
NYC’s 3 1/2 year mandatory jail time sentence for carrying a loaded weapon

Dec 03 20:51
Marcel 2009 is here

Dec 03 14:50
The Return of the Baseball Abstract?  No, the next best thing…

Dec 03 10:42
What was Pedro worth?

Dec 03 10:20
Complete Run Expectancy, Retrosheet Years

THREADS

September 30, 2008
Sabermetric Moves of the 2009 Pre-Season

December 03, 2008
What would happen if the shootout period was 10 minutes, not 5?

December 03, 2008
Avery being Avery

December 03, 2008
How to calculate the area of a baseball field

December 03, 2008
NYC’s 3 1/2 year mandatory jail time sentence for carrying a loaded weapon

December 02, 2008
The Holy Writers strike again!

December 02, 2008
RARP v VORP, take 2

December 02, 2008
Estimating BABIP

December 01, 2008
What was Pedro worth?

December 01, 2008
Get Rice in the Hall already