THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

Filter posts by...

 

Wednesday, January 05, 2011

Geoff Harcourt’s ID project

By Tangotiger, 12:31 PM

This is from Geoff:

=============================

I love stats, although I don’t think I’m particularly original when it comes to sabermetric ideas. For months I’ve been looking for a way I could contribute to the community, and I thought this would be a perfect opportunity to put some of my coding and database knowledge to work. I’ve been continually frustrated by the hassle of trying to compare player data from sources that use different ID systems. I saw a couple efforts had been started to bring ID systems into some kind of crosswalk, but it seems like they all either stalled or didn’t have the columns I wanted. To that end, I’ve started a project on Github called MLB Rosetta that I’m hoping will become a universal crosswalk of ID systems. I’ve currently pulled in 101,000 players courtesy of a file Brian Cartwright sent me. I’ve augmented that file with data from Ted Turocy’s player register project and Tom Tango’s initial Export Map project. Currently I have player IDs from MLBAM, BIS, BP, BDB, BBRef, Westbay, and NPB.

Here’s a few notes about the players table:

id - the MLB Advanced Media ID number of the player. I’ve opted to always use this as the unique identifier, so unless MLB “knows” about the player (they have played in MLB, MiLB, WBC, other international exhibition, etc.), they won’t be included in the database. I’m going to establish a second table to track Japanese, Korean, and NCAA players that aren’t yet in MLBAM’s database, but to start I wanted a table where MLB would be the ID provider of record. There are currently several players who were erroneously assigned two MLBAM IDs (Bryce Harper, Ervin/Elvin Santana...), and I’m trying to build a current list of dupes with some help from Brian and Tom.

bis_id, bis_milb_id - the ID numbers used by Baseball Information Solutions, who provide data to Fangraphs. Players who have appeared in the majors have bis_id numbers, and players who have only made minor league numbers have some alternative ID system that I’m assuming is also BIS, which I’m calling bis_milb_id. You can tell these minor league ID numbers because they start with two letters.

baseball_prospectus_id - ID numbers for Baseball Prospectus player cards. Some of these numbers may be wrong, although I’ve taken pains to ensure that they are correct for current players. Since BP has a predictable system for ID numbers (last name, birthdate, incrementing letter), in any case where there are no potential conflicts (two players with the same last name and birthdate) I’ve gone ahead and populated the ID numbers for players who do not yet have IDs. BP actually seems to have better data on some players from the 1800s than does MLB (many of MLB’s players from that century are missing first names, etc.).

japan_npb_id - ID numbers used by Nippon Pro Baseball, the major leagues of Japan. Anyone who played in Japan in 2010 who is known to MLB has an NPB ID, but this column is still very incomplete. If anyone has lists of players from 2007-2009, I’d love to add them in here.

current - I’ve tried to only mark players who are currently on 40-man rosters as current. I did this purely for my own sanity.

I think all the other columns on the table should be self-explanatory. I’ve checked to make sure that none of the columns currently have duplicates, so if you notice any, please let me know. The most current version of this table will always be posted at https://github.com/geoffharcourt/mlb_rosetta, and I’m hoping that anyone else who wants to contribute will do so (Github makes it easy!). If you have player ID data that you’d like to integrate (CBS, ESPN, STATS, I’m looking at you), please feel free to contribute. My next steps are to create a register of players outside of MLBAM’s database and to flesh out the error/dupes list. Special thanks to Brian, Ted, and Tom, whose previous work probably shaved a few weekends off of this first edition.

-Geoff geoff~harcourt~gmail~com https://github.com/geoffharcourt/mlb_rosetta

===============================

Data file is also here:
http://www.tangotiger.net/files/mlb_rosetta.sql.zip

(12) Comments • 2011/05/05 • SabermetricsData
Page 1 of 1 pages

Latest...

COMMENTS

May 25 19:41
What sabermetrics is NOT

May 25 19:41
Pete Palmer’s new book: Basic Ball

May 25 19:38
“Why Kickstarter works”

May 25 17:32
Largest demonstration in Canadian history?

May 25 16:59
Howard Stern

May 25 15:12
Do pitcher’s reach back for velocity when needed?

May 25 12:51
Chad Curtis

May 25 11:26
Lack of hustle during a game

May 25 10:58
Rooting for laundry

May 25 02:38
NFLPA lawsuit against collusion

THREADS

January 05, 2011
Geoff Harcourt’s ID project