Tuesday, January 11, 2011
Minor League Splits Database
Available! From Jeff.
I of course have a soft spot for anyone who gives away stuff that they spent loads of time doing, and Jeff here rises way up my list for doing what he is doing.
Buy The Book from Amazon
Available! From Jeff.
I of course have a soft spot for anyone who gives away stuff that they spent loads of time doing, and Jeff here rises way up my list for doing what he is doing.
This is from Geoff:
=============================
I love stats, although I don’t think I’m particularly original when it comes to sabermetric ideas. For months I’ve been looking for a way I could contribute to the community, and I thought this would be a perfect opportunity to put some of my coding and database knowledge to work. I’ve been continually frustrated by the hassle of trying to compare player data from sources that use different ID systems. I saw a couple efforts had been started to bring ID systems into some kind of crosswalk, but it seems like they all either stalled or didn’t have the columns I wanted. To that end, I’ve started a project on Github called MLB Rosetta that I’m hoping will become a universal crosswalk of ID systems. I’ve currently pulled in 101,000 players courtesy of a file Brian Cartwright sent me. I’ve augmented that file with data from Ted Turocy’s player register project and Tom Tango’s initial Export Map project. Currently I have player IDs from MLBAM, BIS, BP, BDB, BBRef, Westbay, and NPB.
Here’s a few notes about the players table:
id - the MLB Advanced Media ID number of the player. I’ve opted to always use this as the unique identifier, so unless MLB “knows” about the player (they have played in MLB, MiLB, WBC, other international exhibition, etc.), they won’t be included in the database. I’m going to establish a second table to track Japanese, Korean, and NCAA players that aren’t yet in MLBAM’s database, but to start I wanted a table where MLB would be the ID provider of record. There are currently several players who were erroneously assigned two MLBAM IDs (Bryce Harper, Ervin/Elvin Santana...), and I’m trying to build a current list of dupes with some help from Brian and Tom.
bis_id, bis_milb_id - the ID numbers used by Baseball Information Solutions, who provide data to Fangraphs. Players who have appeared in the majors have bis_id numbers, and players who have only made minor league numbers have some alternative ID system that I’m assuming is also BIS, which I’m calling bis_milb_id. You can tell these minor league ID numbers because they start with two letters.
baseball_prospectus_id - ID numbers for Baseball Prospectus player cards. Some of these numbers may be wrong, although I’ve taken pains to ensure that they are correct for current players. Since BP has a predictable system for ID numbers (last name, birthdate, incrementing letter), in any case where there are no potential conflicts (two players with the same last name and birthdate) I’ve gone ahead and populated the ID numbers for players who do not yet have IDs. BP actually seems to have better data on some players from the 1800s than does MLB (many of MLB’s players from that century are missing first names, etc.).
japan_npb_id - ID numbers used by Nippon Pro Baseball, the major leagues of Japan. Anyone who played in Japan in 2010 who is known to MLB has an NPB ID, but this column is still very incomplete. If anyone has lists of players from 2007-2009, I’d love to add them in here.
current - I’ve tried to only mark players who are currently on 40-man rosters as current. I did this purely for my own sanity.
I think all the other columns on the table should be self-explanatory. I’ve checked to make sure that none of the columns currently have duplicates, so if you notice any, please let me know. The most current version of this table will always be posted at https://github.com/geoffharcourt/mlb_rosetta, and I’m hoping that anyone else who wants to contribute will do so (Github makes it easy!). If you have player ID data that you’d like to integrate (CBS, ESPN, STATS, I’m looking at you), please feel free to contribute. My next steps are to create a register of players outside of MLBAM’s database and to flesh out the error/dupes list. Special thanks to Brian, Ted, and Tom, whose previous work probably shaved a few weekends off of this first edition.
-Geoff geoff~harcourt~gmail~com https://github.com/geoffharcourt/mlb_rosetta
===============================
Data file is also here:
http://www.tangotiger.net/files/mlb_rosetta.sql.zip
Note to Lucas: don’t worry about the low correlation. You will get a low correlation on HR/PA if you look at that month-to-month as well, or OBP if you look at that week-to-week. Correlations are completely 100% useless unless you also report the number of trials per player.
I cringe at every reporting of r that is not accompanied with the number of trials.
Kincaid gives it to you, including the SQL. I didn’t check to see how he handles the Babe Ruth issue.
Overall, he does it very similarly to the way I do it.
A couple of weeks I posted my results (zip). If someone wants to compare my results to Kincaid, and report the differences, that would be lovely. Maybe I’ll post my SQL as well.
To read the first line:
- BEFORE a home run is hit, the average base/out state has a run expectancy of .526 runs
- DURING a home run, 1.60 runs are scored
- AFTER a home run, the run expectancy is .328 runs
Linear Weights is simply DURING+AFTER-BEFORE.... and that’s 1.398 runs. Basically, the popular and static 1.40 runs that we’ve come to associate with the home run. The “error” includes fielder’s choice (where all runners are safe), which is why the starting state has such a high run expectancy (when you have runners on base, your starting state is high… this is apparent with the IBB). You’ll note that the HR is in a slightly lower run expectancy environment, meaning disproportionate number of HR hit with bases empty or with 2 outs. Same applies with the strikeout, which is why the average strikeout isn’t as costly as we’d think, and why, from 1993-2010, the run value of a strikeout matches that of the regular out.
For the “Runs”, think of it as the number of RBIs for each of those events (if they awarded an RBI for every event).
LWTS RE_START RE_END Runs N EVENT_CD SHORTNAME_TX LONGNAME_TX
1.398 0.526 0.328 1.60 80271 23 HR Homerun
1.059 0.540 0.947 0.65 14868 22 3B Triple
0.773 0.538 0.874 0.44 138138 21 2B Double
0.527 0.629 0.924 0.23 32213 18 ROE Error
0.471 0.543 0.786 0.23 459946 20 1B Single
0.384 0.616 0.967 0.03 303 17 XI Interference
0.347 0.571 0.887 0.03 25682 16 HBP Hit By Pitch
0.322 0.514 0.814 0.02 235199 14 NIBB Nonintentional Walk
0.176 0.743 0.919 0.00 17624 15 IBB Intentional Walk
-0.294 0.538 0.214 0.03 1429486 2 Out Generic Out
-0.296 0.510 0.213 0.00 480203 3 K Strikeout
0.267 0.719 0.835 0.15 4632 10 PB Passed Ball
0.265 0.703 0.805 0.16 21801 9 WP Wild Pitch
0.254 0.673 0.777 0.15 2889 11 BK Balk
0.182 0.593 0.757 0.02 42039 4 SB Stolen Base
0.124 0.468 0.591 0.00 441 5 DI Defensive Indifference
-0.270 0.657 0.355 0.03 8809 8 PK Pickoff
-0.463 0.613 0.146 0.00 14857 6 CS Caught Stealing
-0.473 0.724 0.226 0.03 916 12 OA Other Advance
Source data: courtesy of Retrosheet, the sabremetric slice bread
Finally. Finally! Want to know the best wOBA, for 2006-2010, in high-leverage situations? Pujols (naturally), Ryan Howard, Carlos Beltran, and then ARod. In the same situations, Ryan Howard has 0.43 HR per FB! Arod is at 0.31.
Who pads his stats in low-leverage situations? Well, that’s ARod.
Great stuff!
David, are you ready for a challenge? Let us select the stat (say wOBA), and a split “category” (say leverage), and then show the wOBA by the splits (low, medium, high). Or say select K%, and then a split category (runners), and then show us that. And so on.
In any case, the splits leaderboards is already awesome.
http://tangotiger.net/retrosheet/reports/hands.xls (Excel)
http://tangotiger.net/retrosheet/reports/hands.csv (Text)
This will give you the total number of PA by batting hand and by pitching hand. It also breaks it down by batting hand v opposing pitching hand, and by pitching hand v opposing batting hand.
I then come up with a determination if the batter was a switch hitter or not.
Data from 1950-2010, courtesy of Retrosheet, our sabremetric sliced bread.
I can never get enough of people who roll up their sleeves, and then share their building blocks. This is really no different than me teaching my kids to share their toys. And that’s what we’ve got here… toys. What you do with these toys, well, that’s the value-added you have. But, these building blocks? No reason to keep them all to yourself. Sobchak has proven himself to be fantastic in sharing, so, kudos for continuing to supply us with his toys.
I have made a few more changes, notably:
- all headers now sortable
- added a team history page
- you get the Indis for players as both pitchers and nonpitchers, like with Greg Maddux, Tim Wallach, or Babe Ruth, with their non-dominant role shaded in red
I’ll eventually add AB, H, HR, BB, SB, wOBA, birth date, etc. Just not right now.
Very cool. I like that the inning is preserved. I like the pitcher breakdowns by inning as well.
Glove-slap: B.
Finally. After so many years of waiting for someone to come up with a “my stats” feature, ala My Yahoo customized news and sports settings, Fangraphs has the Custom Dashboard.
I’ll give you four reasons why this is smart:
1. The user gets exactly what he wants
2. You as the website get a list of email addresses of users
3. You as the website now know exactly what stats your readers like
4. I had a 4th one, but it has now slipped my mind. I think faster than I type usually, and sometimes, I type faster than I think. Neither is helpful.
I was on Sean’s case like crazy regarding ERA+, as many of you know. Basically, while every other index stat in the world did the value of the metric divided by the “average”, ERA+ did the league average divided by the ERA of the player. In effect, instead of ER per IP, it was doing IP per ER. What made it worse is when people started to use this in calculations, using it for simple averages etc. The math did not work out.
I had proposed that he do it the consistent way, which would mean someone who gives up runs at half the league average show as 50, rather than 200. Sean was rightfully concerned that people are used to “bigger is better”, and so, that would look like a sticker shock.
Guy proposed something very simple: 2 - ERA/lgERA then times 100. This way, what would look like 50 for me would show up as 150. And the top end is 200 in the Guy method (or 0 in my method). And Sean did just that.
I was relentless in such a seemingly small thing. But it was important to show the symmetry of 50 and 150 to hold. Either my method or Guy’s method would have done that (and the original version of ERA+ did not hold to that). Kudos to Sean for being good enough for taking the brunt of my esotericness (esotericality?). For all the crap I gave him about it, I deserve to give him his kudos just as much.
Glove-slap: Colin.
From the man who brought you the injury database.
You can post it here or there.
- I reiterated to him that I like Guy’s suggestion of ERA+ as 2 - ERA/lgERA. It keeps the bigger-is-better so that his readers aren’t shocked, while maintaining the symmetry of what an ERA+ of 50 and 150 should be. It sounds like Sean might go for it.
- I’ll second the request for a split by DP opps.
- I’ll add that I want to see a split by SF opps.
- Times facing opponent should have 1,2,3,4+, not just 1,2,3+.
- Where’s FIP?
- Also, add in bbFIP.
- Do away with OPS+ and make it RC+…
- ...with RC based on Linear Weights, not the basic version that is 30 years dated
Someone asked, so here you go:
1. http://tangotiger.net/scout/index5.php (list of all players)
2. select all data in table (do NOT grab the headers)
3. ctrl-c
4. go to excel
5. this is important… RIGHT-CLICK, PasteSpecial, Text
6. done. 480 records, perfectly formatted
Good data across the board from Justin, including BaseRuns.
May 16 22:50
Dodgers’ win reversed because Mattingly did not attest to proper score!
May 16 20:44
How to beat the shift
May 16 20:02
Sponsoring MLB jerseys
May 16 19:34
Now you frame it, now you don’t
May 16 16:56
Did Manny Pacquaio actually quote Leviticus?
May 16 16:06
Does changing your pitch frequency lead to substantial change in results?
May 16 14:18
Extra Innings: One-minute review
May 16 14:16
This particular criticism of UZR is unfounded
May 16 13:21
Psst… wanna intern for the Astros?
May 16 12:23
Arena wars
THREADS
May 16, 2012
Now you frame it, now you don’t
May 16, 2012
Dodgers’ win reversed because Mattingly did not attest to proper score!
May 16, 2012
Does changing your pitch frequency lead to substantial change in results?
May 16, 2012
Sponsoring MLB jerseys
May 15, 2012
Andre The Hawk Dawson speaks
May 15, 2012
Euro 2012 Preview
May 15, 2012
How to beat the shift
May 15, 2012
Will Pujols end the season with at least 30 HR and .500 SLG?
May 15, 2012
Kershaw v Strasburg, part 2
May 15, 2012
Did Manny Pacquaio actually quote Leviticus?
Recent comments
Older comments
Page 3 of 342 pages « First < 1 2 3 4 5 > Last »Complete Archive – By Category
Complete Archive – By Date