THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, February 01, 2010

The Universal Player ID and Biographical Data project

By Tangotiger, 02:57 PM

All those who are extremely annoyed at all the databases, files, and systems out there that use their own player IDs, please raise your hand.  Yes, me too.  It bothers me to no end.

I will propose the following: let’s make MLBAM ID the universal ID, and let’s create a mapping table against that ID for every source we are interested in (Retro, BDB, STATS, BIS, etc).  The reason for the MLBAM ID to be the universal ID is that they have all the players were are interested in.  Indeed, at some point, I can see MLBAM be interested in college and high school, and so, an easy way to expand their pool.  Japanese players?  Well, ok, you got a point, but let’s consolidate everything else first.

I will post a file soon that has all the IDs that I have created in partnership with MLBAM.  They supplied with their IDs, and their maps to Retro and BDB.  I then updated the Retro and BDB IDs that were wrong.  I then added my own mappings of STATS and BIS.  And I will post that file.  Ideally, you guys then validate the data.  After we get that done, the fun starts: validating biographical data.

While the BDB has bio data for some 17,000 players, there are over 80,000 players in the MLBAM file.  For all of you guys who want to help sabermetrics but are afraid or intimidated, this is the grunt work that needs to be done.  All I can say is: help me. 

UPDATE:
You will see something like this:
MLBAM_ID,retro_id,bdb_id,stats_id,bis_id,source_id
110015,abbop001,abbotpa01,4543,1061,292

That’s the MLBAMID, Retrosheet ID, BDB Id, STATS Id, BIS ID. 

The “source id” is for me, as it tracks where I got my data from.  “292” just tells me the data starting with “source 4”, then updated with “source 32” and “source 256”.  I have 15 different sources I cobbled together.

Anyway, download it, link it with your own data sources, and report any problems.  This is version 0.1.  It is not complete.
EXPORT_ID_MAP.zip


SabermetricsData
#1    Willie McTell      (see all posts) 2010/02/01 (Mon) @ 15:29

Seems to me that player id should always be database specific, an integer data type or maybe a universal id. The abstract id should never be part of the data. What would be nice would be a standardized player “name,” a cross database data element.


#2    Tangotiger      (see all posts) 2010/02/01 (Mon) @ 15:46

Willie: the MLBAM ID satisfies that.  It’s simply an autonum, and has no meaning.  It’s the way a key should be.  The same applies for STATS and BIS.  So, that’s not the issue.

The issue is when I have multiple data sources, how to link all the data.  You need the ID mapping table.  And if it doesn’t exist, you need to create it and update it.  It’s a pain beyond belief.


#3          (see all posts) 2010/02/01 (Mon) @ 15:50

Great Tango!

Broadly this project is important.

Practically it may be crucial when & how new players enter the MLBAM database, whether & when & how MLBAM will provide data on new players.

Of course, MLBAM & STATS & BIS are already in business and we have the public player ID sets now. So Tom’s cross-tab must be valuable even without benefit of smoothly working with MLBAM in the future.


#4    David Pinto      (see all posts) 2010/02/01 (Mon) @ 15:57

Let me know if I can help.


#5    Brian Cartwright      (see all posts) 2010/02/01 (Mon) @ 19:01

This is something I have already done a lot of work on, and would be glad to help.

My db has 18,078 mlbamID’s although 206 of them are no longer is use (usually duplicate numbers for same person).

My db starts in 1998, and I have about 35,000 players.

I have a nearly complete collection of college stats for 2002-2009, with some prior years, but they do not have IDs. This is another 40,000 or so players.

I have a complete set of the Westbay and Eng IDs for Japan, and the Korean IDs.

mlbamID covers all the Gameday leagues 2005 to present. This includes Mexico and Israel and the Caribbean winter leagues.

SABR’s milbID covers all of the US minors and majors back several decades. In includes US independent leagues and Japan, which mlbamid does not cover.


#6    Tangotiger      (see all posts) 2010/02/02 (Tue) @ 12:54

Brian: if we take it a bit at a time, can you send me:
mlbamid, retroid, bdbid

David Pinto: can you send me
retroid, bdbid, statsid, bisid
(and the mlbamid if you have it, but not necessary)

Thanks…


#7    Tangotiger      (see all posts) 2010/02/02 (Tue) @ 13:03

email is tom~tangotiger~net


#8          (see all posts) 2010/02/03 (Wed) @ 22:18

"Pain beyond belief” doesn’t even begin to describe it. If you can find
anything of value in what I’ve prepared so far this year, help yourself.

More of a year-to-year thing for me, but maybe there’s something useful there.

Equally painful to me has been the unresponsiveness of organizations such as STATS to requests for data.  Hopefully this project will eliminate that roadblock.


#9          (see all posts) 2010/02/03 (Wed) @ 23:26

This is an outstanding endeavor.

I am only a couple of hours of work away from having a mapping of stats onto bdb - an update of the Xref_Stats table through 2009.  If David has this completed, it would be great to get.  If not, I can certainly finish mine and sent it.


#10    Tangotiger      (see all posts) 2010/02/04 (Thu) @ 10:52

Scott, someone sent me something along those lines already.  If you haven’t started, you can hold off if you like.

I’m just waiting to hear from Brian and David first with their data.


#11    Tangotiger      (see all posts) 2010/02/05 (Fri) @ 17:50

At the top of this thread, I have added an “UPDATE”, include a zipped file with all the IDs matched.  Let me tell you: this was NOT fun!

Anyway, download it, link it to your own ID files, and report errors, or fix your own ID files.

Anyone who submitted an ID file, please RESUBMIT after you have fixed your errors.  Ideally, after 2 or 3 iterations, we should all be synched up, and I can publish the “final” zipped file with all the IDs perfectly matched.


#12    Josh      (see all posts) 2010/02/05 (Fri) @ 18:46

Let me just say, Tangotiger, you’re my hero.

I’ll submit any corrections I can find.


#13    Tangotiger      (see all posts) 2010/02/05 (Fri) @ 19:17

I forgot to note that these are showing duplicates based on conflicting data sources and I haven’t tried to resolve them yet:

bis_id mlbam1 mlbam2
18 297292 491703
49 112004 452035
263 110638 445216
936 233594 446248
1137 237796 425532
1479 400123 420944

stats_id mlbam1 mlbam2
6918 400123 420944


#14    Joe Arthur      (see all posts) 2010/02/05 (Fri) @ 22:36

your 2nd source for mlb id has the correct mapping to bis_id in all cases you list in #13. Also the 2nd source has the correct mapping to stats id 6918 (Mike Smith). [400123 should map to 6728 (Roy Smith)]. The others from source 1 must map to minor leaguers, since I don’t have those IDs at all in my own mapping…


#15    Peter Jensen      (see all posts) 2010/02/17 (Wed) @ 11:31

Tango - You may want to add another ID number to the map that you are creating.  B-Ref, which uses the BDB-Id for its major league data uses a different form of ID for its minor league data, presumably inherited from the SABR minor league database project which is the provider of minor league data to B-Ref.  Since the B-Ref minor league database has some biographical data on minor league players such as their home town and draft position and team as well as the year end stats it is a worthwhile reference source to have a link to.


#16    Tangotiger      (see all posts) 2010/02/17 (Wed) @ 11:49

Actually, B-Ref does NOT use the BDBid.  You can check Kevin Youkilis and Jacque Jones.  However, in the MASTER table of the BDB, there IS the brID there, so that’s easy enough to add.

As for the minor league ID, I’ll ask Ted if he wants to give it to me.  (I think Ted is in charge of the minor league data.)


#17    joe arthur      (see all posts) 2010/02/17 (Wed) @ 13:46

I don’t know how accurate the cross references to the b-ref ID are in the master table of BDB. I found some inaccuracies in the retro_ids in that table however ...
lahmanid | bdb_retro | retroid
---------+-----------+---------
13176 | smitb113 | smitb107
13175 | smitb106 | smitb113
10524 | obrij102 | obrip101
9275 | mcgob001 | mcgob901
7595 | klemb001 | klemb901
6644 | hubbc001 | hubbc901
4766 | fuchj801 | fuche801
3684 | donnj102 | donnp101
3372 | dayj801 | day-j801
2436 | chyln001 | chyln901
1051 | bick801 | bicku801
616 | barla001 | barla901
18072 | kigem101 | kigem001
18110 | carfe001 | caffu101
18111 | glasn001 | glasn101


#18    Ed D.      (see all posts) 2010/02/19 (Fri) @ 03:15

Here are a few recent vintage hitters missing from your .zip file (I think) ...

retro_id | bis_id | name
-------------------------------------
hills002 | 196 | shea hillenbrand
wrigr002 | 1103 | ron wright (one game only!)
mahom001 | 1544 | mike mahoney
whitm003 | 1720 | matt white
edwam001 | 1850 | mike edwards
johnb003 | 2731 | ben johnson
smitr004 | 407 | roy smith
butlb002 | 423 | brent butler

and I think this is one of your duplicates from post #13:

retro_id | bis_id | name
-------------------------------------
willj002 | 674 | jeff williams

You show his bis_id as 1137, but I think it should be 674.

-Ed D.


#19          (see all posts) 2010/02/24 (Wed) @ 10:12

You should coordinate this with MLBAM, as they’ve done some work doing these mappings already.  You may also need to decide how to handle DOB and name changes.


#20    Tangotiger      (see all posts) 2010/02/24 (Wed) @ 10:40

Chris, you may have missed this:

I will post a file soon that has all the IDs that I have created in partnership with MLBAM.  They supplied with their IDs, and their maps to Retro and BDB.

I have a working relationship with the guys there, and we’re going to do our best to make sure we can get something unified out there.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Aug 31 15:28
Fans Scouting Report: Update

Sep 02 14:26
Mail: rWAR v fWAR

Sep 02 14:15
WOWY Teachers

Sep 02 13:37
Who’s Waldo?

Sep 02 13:00
It’s hard to beat the crowd (Vegas in this case) no matter how smart you think you are

Sep 02 12:05
Could Rob Dibble have been a comp for Strasburg?

Sep 02 08:36
Team Elin

Sep 02 01:19
Can someone tell me why Trevor Hoffman is still allowed to pitch?

Sep 01 23:16
Strasburg II

Sep 01 22:11
PITCHf/x Summit 2010 - Recaps