THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

Filter posts by...

 

Data

Tuesday, January 11, 2011

Minor League Splits Database

By Tangotiger, 12:07 PM

Available!  From Jeff.

I of course have a soft spot for anyone who gives away stuff that they spent loads of time doing, and Jeff here rises way up my list for doing what he is doing. 

(2) Comments • 2011/01/12 • SabermetricsDataMinors_College

Wednesday, January 05, 2011

Geoff Harcourt’s ID project

By Tangotiger, 12:31 PM

This is from Geoff:

=============================

I love stats, although I don’t think I’m particularly original when it comes to sabermetric ideas. For months I’ve been looking for a way I could contribute to the community, and I thought this would be a perfect opportunity to put some of my coding and database knowledge to work. I’ve been continually frustrated by the hassle of trying to compare player data from sources that use different ID systems. I saw a couple efforts had been started to bring ID systems into some kind of crosswalk, but it seems like they all either stalled or didn’t have the columns I wanted. To that end, I’ve started a project on Github called MLB Rosetta that I’m hoping will become a universal crosswalk of ID systems. I’ve currently pulled in 101,000 players courtesy of a file Brian Cartwright sent me. I’ve augmented that file with data from Ted Turocy’s player register project and Tom Tango’s initial Export Map project. Currently I have player IDs from MLBAM, BIS, BP, BDB, BBRef, Westbay, and NPB.

Here’s a few notes about the players table:

id - the MLB Advanced Media ID number of the player. I’ve opted to always use this as the unique identifier, so unless MLB “knows” about the player (they have played in MLB, MiLB, WBC, other international exhibition, etc.), they won’t be included in the database. I’m going to establish a second table to track Japanese, Korean, and NCAA players that aren’t yet in MLBAM’s database, but to start I wanted a table where MLB would be the ID provider of record. There are currently several players who were erroneously assigned two MLBAM IDs (Bryce Harper, Ervin/Elvin Santana...), and I’m trying to build a current list of dupes with some help from Brian and Tom.

bis_id, bis_milb_id - the ID numbers used by Baseball Information Solutions, who provide data to Fangraphs. Players who have appeared in the majors have bis_id numbers, and players who have only made minor league numbers have some alternative ID system that I’m assuming is also BIS, which I’m calling bis_milb_id. You can tell these minor league ID numbers because they start with two letters.

baseball_prospectus_id - ID numbers for Baseball Prospectus player cards. Some of these numbers may be wrong, although I’ve taken pains to ensure that they are correct for current players. Since BP has a predictable system for ID numbers (last name, birthdate, incrementing letter), in any case where there are no potential conflicts (two players with the same last name and birthdate) I’ve gone ahead and populated the ID numbers for players who do not yet have IDs. BP actually seems to have better data on some players from the 1800s than does MLB (many of MLB’s players from that century are missing first names, etc.).

japan_npb_id - ID numbers used by Nippon Pro Baseball, the major leagues of Japan. Anyone who played in Japan in 2010 who is known to MLB has an NPB ID, but this column is still very incomplete. If anyone has lists of players from 2007-2009, I’d love to add them in here.

current - I’ve tried to only mark players who are currently on 40-man rosters as current. I did this purely for my own sanity.

I think all the other columns on the table should be self-explanatory. I’ve checked to make sure that none of the columns currently have duplicates, so if you notice any, please let me know. The most current version of this table will always be posted at https://github.com/geoffharcourt/mlb_rosetta, and I’m hoping that anyone else who wants to contribute will do so (Github makes it easy!). If you have player ID data that you’d like to integrate (CBS, ESPN, STATS, I’m looking at you), please feel free to contribute. My next steps are to create a register of players outside of MLBAM’s database and to flesh out the error/dupes list. Special thanks to Brian, Ted, and Tom, whose previous work probably shaved a few weekends off of this first edition.

-Geoff geoff~harcourt~gmail~com https://github.com/geoffharcourt/mlb_rosetta

===============================

Data file is also here:
http://www.tangotiger.net/files/mlb_rosetta.sql.zip

(12) Comments • 2011/05/05 • SabermetricsData

Tuesday, January 04, 2011

Bunts, bunts, bunts

By Tangotiger, 10:41 PM

Note to Lucas: don’t worry about the low correlation.  You will get a low correlation on HR/PA if you look at that month-to-month as well, or OBP if you look at that week-to-week.  Correlations are completely 100% useless unless you also report the number of trials per player. 

I cringe at every reporting of r that is not accompanied with the number of trials.

(0) Comments • • SabermetricsData

Thursday, December 23, 2010

Primary Fielding Position

By Tangotiger, 12:49 PM

Kincaid gives it to you, including the SQL.  I didn’t check to see how he handles the Babe Ruth issue. 

Overall, he does it very similarly to the way I do it.

A couple of weeks I posted my results (zip).  If someone wants to compare my results to Kincaid, and report the differences, that would be lovely.  Maybe I’ll post my SQL as well.

(2) Comments • 2010/12/23 • SabermetricsData

Saturday, December 11, 2010

Linear Weights 1993-2010

By Tangotiger, 11:45 AM

To read the first line:
- BEFORE a home run is hit, the average base/out state has a run expectancy of .526 runs
- DURING a home run, 1.60 runs are scored
- AFTER a home run, the run expectancy is .328 runs

Linear Weights is simply DURING+AFTER-BEFORE.... and that’s 1.398 runs.  Basically, the popular and static 1.40 runs that we’ve come to associate with the home run.  The “error” includes fielder’s choice (where all runners are safe), which is why the starting state has such a high run expectancy (when you have runners on base, your starting state is high… this is apparent with the IBB).  You’ll note that the HR is in a slightly lower run expectancy environment, meaning disproportionate number of HR hit with bases empty or with 2 outs.  Same applies with the strikeout, which is why the average strikeout isn’t as costly as we’d think, and why, from 1993-2010, the run value of a strikeout matches that of the regular out.

For the “Runs”, think of it as the number of RBIs for each of those events (if they awarded an RBI for every event).

LWTS     RE_START      RE_END      Runs     N    EVENT_CD    SHORTNAME_TX    LONGNAME_TX
1.398     0.526      0.328     1.60     80271    23    HR    Homerun
1.059     0.540      0.947     0.65     14868    22    3B    Triple
0.773     0.538      0.874     0.44     138138    21    2B    Double
0.527     0.629      0.924     0.23     32213    18    ROE    Error
0.471     0.543      0.786     0.23     459946    20    1B    Single
0.384     0.616      0.967     0.03     303    17    XI    Interference
0.347     0.571      0.887     0.03     25682    16    HBP    Hit By Pitch
0.322     0.514      0.814     0.02     235199    14    NIBB    Nonintentional Walk
0.176     0.743      0.919     0.00     17624    15    IBB    Intentional Walk
-0.294     0.538      0.214     0.03     1429486    2    Out    Generic Out
-0.296     0.510      0.213     0.00     480203    3    K    Strikeout

0.267     0.719      0.835     0.15     4632    10    PB    Passed Ball
0.265     0.703      0.805     0.16     21801    9    WP    Wild Pitch
0.254     0.673      0.777     0.15     2889    11    BK    Balk
0.182     0.593      0.757     0.02     42039    4    SB    Stolen Base
0.124     0.468      0.591     0.00     441    5    DI    Defensive Indifference
-0.270     0.657      0.355     0.03     8809    8    PK    Pickoff
-0.463     0.613      0.146     0.00     14857    6    CS    Caught Stealing
-0.473     0.724      0.226     0.03     916    12    OA    Other Advance

Source data: courtesy of Retrosheet, the sabremetric slice bread

(7) Comments • 2010/12/13 • SabermetricsData

Friday, December 10, 2010

Fangraphs splits + multi-year, on the same request

By Tangotiger, 02:36 PM

Finally.  Finally!  Want to know the best wOBA, for 2006-2010, in high-leverage situations?  Pujols (naturally), Ryan Howard, Carlos Beltran, and then ARod.  In the same situations, Ryan Howard has 0.43 HR per FB!  Arod is at 0.31.

Who pads his stats in low-leverage situations?  Well, that’s ARod.

Great stuff! 

David, are you ready for a challenge?  Let us select the stat (say wOBA), and a split “category” (say leverage), and then show the wOBA by the splits (low, medium, high).  Or say select K%, and then a split category (runners), and then show us that.  And so on. 

In any case, the splits leaderboards is already awesome.

(6) Comments • 2010/12/12 • SabermetricsData

Thursday, December 09, 2010

Retrosheet Data - Batting, Pitching Hands

By Tangotiger, 10:24 PM

http://tangotiger.net/retrosheet/reports/hands.xls (Excel)
http://tangotiger.net/retrosheet/reports/hands.csv (Text)

This will give you the total number of PA by batting hand and by pitching hand.  It also breaks it down by batting hand v opposing pitching hand, and by pitching hand v opposing batting hand.

I then come up with a determination if the batter was a switch hitter or not.

Data from 1950-2010, courtesy of Retrosheet, our sabremetric sliced bread.

(0) Comments • • SabermetricsData

Wednesday, December 08, 2010

Fangraphs has splits leaderboards

By Tangotiger, 10:37 PM

Great stuff!

(3) Comments • 2010/12/09 • SabermetricsData

Wednesday, December 01, 2010

Data files for 2010 available now

By Tangotiger, 07:07 PM

Thanks to Sean.

(6) Comments • 2010/12/01 • SabermetricsData

Thursday, October 28, 2010

Building a retrosheet database

By Tangotiger, 07:59 AM

I can never get enough of people who roll up their sleeves, and then share their building blocks.  This is really no different than me teaching my kids to share their toys.  And that’s what we’ve got here… toys.  What you do with these toys, well, that’s the value-added you have.  But, these building blocks?  No reason to keep them all to yourself.  Sobchak has proven himself to be fantastic in sharing, so, kudos for continuing to supply us with his toys.

(4) Comments • 2012/05/07 • SabermetricsData

Monday, October 11, 2010

Injury data, 2010

By Tangotiger, 03:43 PM

Jeff posts the 2010 data

(1) Comments • 2010/10/11 • SabermetricsDataTraining_Health

Tuesday, September 14, 2010

THT Graphical Reports

By Tangotiger, 09:18 AM

Download area.

(0) Comments • • SabermetricsData

Thursday, June 24, 2010

Individualized Won-Loss Records Website

By Tangotiger, 10:21 AM

I have made a few more changes, notably:
- all headers now sortable
- added a team history page
- you get the Indis for players as both pitchers and nonpitchers, like with Greg Maddux, Tim Wallach, or Babe Ruth, with their non-dominant role shaded in red

I’ll eventually add AB, H, HR, BB, SB, wOBA, birth date, etc.  Just not right now.

(33) Comments • 2010/06/28 • SabermetricsData

Saturday, June 05, 2010

Korean Boxscore

By Tangotiger, 07:41 PM

Very cool.  I like that the inning is preserved.  I like the pitcher breakdowns by inning as well.

Glove-slap: B.

(2) Comments • 2010/06/06 • SabermetricsData

Tuesday, June 01, 2010

“My stats”: Custom Dashboard

By Tangotiger, 04:07 PM

Finally.  After so many years of waiting for someone to come up with a “my stats” feature, ala My Yahoo customized news and sports settings, Fangraphs has the Custom Dashboard.

I’ll give you four reasons why this is smart:
1. The user gets exactly what he wants
2. You as the website get a list of email addresses of users
3. You as the website now know exactly what stats your readers like
4. I had a 4th one, but it has now slipped my mind.  I think faster than I type usually, and sometimes, I type faster than I think.  Neither is helpful.

(7) Comments • 2010/06/02 • SabermetricsData

Wednesday, March 24, 2010

Thank you Mr Forman

By Tangotiger, 03:37 PM

I was on Sean’s case like crazy regarding ERA+, as many of you know.  Basically, while every other index stat in the world did the value of the metric divided by the “average”, ERA+ did the league average divided by the ERA of the player.  In effect, instead of ER per IP, it was doing IP per ER.  What made it worse is when people started to use this in calculations, using it for simple averages etc.  The math did not work out.

I had proposed that he do it the consistent way, which would mean someone who gives up runs at half the league average show as 50, rather than 200.  Sean was rightfully concerned that people are used to “bigger is better”, and so, that would look like a sticker shock.

Guy proposed something very simple: 2 - ERA/lgERA then times 100.  This way, what would look like 50 for me would show up as 150.  And the top end is 200 in the Guy method (or 0 in my method).  And Sean did just that

I was relentless in such a seemingly small thing.  But it was important to show the symmetry of 50 and 150 to hold.  Either my method or Guy’s method would have done that (and the original version of ERA+ did not hold to that).  Kudos to Sean for being good enough for taking the brunt of my esotericness (esotericality?).  For all the crap I gave him about it, I deserve to give him his kudos just as much.

Glove-slap: Colin.

(20) Comments • 2010/03/26 • SabermetricsDataStatistical_Theory

Thursday, March 04, 2010

Step by Step Gameday data extractor app

By Tangotiger, 07:22 PM

From the man who brought you the injury database.

(0) Comments • • SabermetricsData

Wednesday, March 03, 2010

Sean Forman is taking your suggestions

By Tangotiger, 03:18 PM

You can post it here or there.

- I reiterated to him that I like Guy’s suggestion of ERA+ as 2 - ERA/lgERA.  It keeps the bigger-is-better so that his readers aren’t shocked, while maintaining the symmetry of what an ERA+ of 50 and 150 should be.  It sounds like Sean might go for it.
- I’ll second the request for a split by DP opps. 
- I’ll add that I want to see a split by SF opps. 
- Times facing opponent should have 1,2,3,4+, not just 1,2,3+.
- Where’s FIP? 
- Also, add in bbFIP.
- Do away with OPS+ and make it RC+…
- ...with RC based on Linear Weights, not the basic version that is 30 years dated

(15) Comments • 2010/03/24 • SabermetricsData

Monday, March 01, 2010

Exporting data into Excel

By Tangotiger, 04:29 PM

Someone asked, so here you go:

1. http://tangotiger.net/scout/index5.php (list of all players)
2. select all data in table (do NOT grab the headers)
3. ctrl-c
4. go to excel
5. this is important… RIGHT-CLICK, PasteSpecial, Text
6. done.  480 records, perfectly formatted

(0) Comments • • SabermetricsData

Tuesday, February 23, 2010

Minor league run environments

By Tangotiger, 10:21 AM

Good data across the board from Justin, including BaseRuns.

Page 3 of 8 pages « First  <  1 2 3 4 5 >  Last »

Latest...

COMMENTS

May 16 22:50
Dodgers’ win reversed because Mattingly did not attest to proper score!

May 16 20:44
How to beat the shift

May 16 20:02
Sponsoring MLB jerseys

May 16 19:34
Now you frame it, now you don’t

May 16 16:56
Did Manny Pacquaio actually quote Leviticus?

May 16 16:06
Does changing your pitch frequency lead to substantial change in results?

May 16 14:18
Extra Innings: One-minute review

May 16 14:16
This particular criticism of UZR is unfounded

May 16 13:21
Psst… wanna intern for the Astros?

May 16 12:23
Arena wars

THREADS

May 16, 2012
Now you frame it, now you don’t

May 16, 2012
Dodgers’ win reversed because Mattingly did not attest to proper score!

May 16, 2012
Does changing your pitch frequency lead to substantial change in results?

May 16, 2012
Sponsoring MLB jerseys

May 15, 2012
Andre The Hawk Dawson speaks

May 15, 2012
Euro 2012 Preview

May 15, 2012
How to beat the shift

May 15, 2012
Will Pujols end the season with at least 30 HR and .500 SLG?

May 15, 2012
Kershaw v Strasburg, part 2

May 15, 2012
Did Manny Pacquaio actually quote Leviticus?