Monday, February 01, 2010
The Universal Player ID and Biographical Data project
All those who are extremely annoyed at all the databases, files, and systems out there that use their own player IDs, please raise your hand. Yes, me too. It bothers me to no end.
I will propose the following: let’s make MLBAM ID the universal ID, and let’s create a mapping table against that ID for every source we are interested in (Retro, BDB, STATS, BIS, etc). The reason for the MLBAM ID to be the universal ID is that they have all the players were are interested in. Indeed, at some point, I can see MLBAM be interested in college and high school, and so, an easy way to expand their pool. Japanese players? Well, ok, you got a point, but let’s consolidate everything else first.
I will post a file soon that has all the IDs that I have created in partnership with MLBAM. They supplied with their IDs, and their maps to Retro and BDB. I then updated the Retro and BDB IDs that were wrong. I then added my own mappings of STATS and BIS. And I will post that file. Ideally, you guys then validate the data. After we get that done, the fun starts: validating biographical data.
While the BDB has bio data for some 17,000 players, there are over 80,000 players in the MLBAM file. For all of you guys who want to help sabermetrics but are afraid or intimidated, this is the grunt work that needs to be done. All I can say is: help me.
UPDATE:
You will see something like this:
MLBAM_ID,retro_id,bdb_id,stats_id,bis_id,source_id
110015,abbop001,abbotpa01,4543,1061,292
That’s the MLBAMID, Retrosheet ID, BDB Id, STATS Id, BIS ID.
The “source id” is for me, as it tracks where I got my data from. “292” just tells me the data starting with “source 4”, then updated with “source 32” and “source 256”. I have 15 different sources I cobbled together.
Anyway, download it, link it with your own data sources, and report any problems. This is version 0.1. It is not complete.
EXPORT_ID_MAP.zip
Seems to me that player id should always be database specific, an integer data type or maybe a universal id. The abstract id should never be part of the data. What would be nice would be a standardized player “name,” a cross database data element.