THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, June 23, 2008

Mapping IDs

By Tangotiger, 01:38 PM

If you are looking to make a contribution to the world of sabermetrics, here is the perfect little project for you:  Create a mapping table of all player IDs out there.


  • Captain Crunch put out the CBS, MLB.com, NFBC (whatever that is), and BP ids. 
  • The BDB has its ID (formerly Lahman), Baseball-Almanac (Holtz), Retrosheet, and B-r.com.
  • Mike Fast (can’t find it right now) posted the Retrosheet, MLB.com mappings.  (Mike: I found a couple of tiny mistakes.  If you can post it somewhere, I’ll give you the errors.)

So, this is what I would like:
1. Post all your mappings somewhere

That’s it.  Someone, maybe me, will then merge all of them to come up with the (current) definitive list.

Ideally, other sites will be as bothered as the rest of us in terms of mapping everything, that they will contribute their mappings of the new players in the future to keep this up-to-date.  All those minor league IDs, college websites, Japanese websites, etc, can finally have everything linked up.  Basically the “universal ID”.  Is it possible?  Let’s see…


SabermetricsData
#1    Tangotiger      (see all posts) 2008/06/23 (Mon) @ 14:10

Oh, and if that’s any kind of success, then we can try parks, teams, and leagues.


#2    Rally      (see all posts) 2008/06/23 (Mon) @ 14:49

Every site I use to get stats in season lists the player by name rather than ID.  For example, go to Fangraphs, select a list, and export to excel.  If you want to work with this and use a player ID, you have to map it from the name to your player ID of choice.

A real pain, especially with multiple Alex Gonzali and proper names vs nicknames (Mike vs Michael).  When I do this, the first column shows the player name, but contains a link with the player ID in it - anyone know who to take that player ID out of the link and put it in a column?

Other sites (ex. Dougstats) do not have a player ID displayed.  Not sure how to get ID’s from Baseball-Ref either, without setting up a program to spyder every player page.

Probably not a big deal if you buy a stat feed from BIS or something, but a real pain for us freeloaders.


#3    Tangotiger      (see all posts) 2008/06/23 (Mon) @ 15:11

Rally, I posted this elsewhere, and will repeat here.  After you download the csv file:

=================================

In a text editor, do this:
search/replace
">
to
,"

(That isreplace two double-quotes and greater than signby comma and double-quote)

-
search/replace
</a>
to

(That isreplace that tag by nothing.)

-
search/replace
"<a href=""statss.aspx?playerid=
to

(That is, replace that string to nothing.)

- search/replace
&position=
to
,"
(That isreplace as a comma and double-quote)

In the first lineadd this text preceding "name"
"playerid"
,"pos",

You now have a properly formatted csv file.

====================================
Note my blog software converts the two-tick character into either opening or closing quotes.  I’m referring to the key right next to the Enter key (SHIFT single quote).


#4    Colin Wyers      (see all posts) 2008/06/23 (Mon) @ 15:46

Sounds like a fun project. Why can I see myself getting roped into this. (Do you think this should be part of the RetroSQL project, or something seperate?)

One thing that would be helpful is a list of where people want mappings for. Obviously any of the mappings already in the BDB would be included, so that covers:

* Retrosheet
* Lahman/BDB
* Baseball Reference
* Baseball Almanac

Ones than I can think of wanting:

* MLB/Elias
* CBS
* ESPN
* Sportsnet.ca (Has actual STATS, Inc. ZR chances, which I like)
* Fangraphs

Hardball Times doesn’t seem to have actual IDs, just last name and first name. You could still do a mapping, but it would look a little different.


#5    Tangotiger      (see all posts) 2008/06/23 (Mon) @ 16:09

I think I would eventually collapse it into RetroSQL.  Remember that the way I’m doing it there is to create a RetroLinks schema, so anything else that comes up here would get linked there.

Note that MLB/Elias should be termed MLB or MLB.com.  According to someone deeply involved with MLB.com, Elias has nothing to do with the ID.


#6    Colin Wyers      (see all posts) 2008/06/23 (Mon) @ 16:23

My preference would be to call it “MLBAM,” then, just because they’re ALL MLB id’s, in a sense.

IDs for non-MLB sources could come from:

* MinorLeagueSplits
* Baseball-Reference
* MiLB.com
* Fangraphs

Am I missing somewhere?


#7    Sang Park      (see all posts) 2008/06/23 (Mon) @ 17:00

I just posted the IDs we’ve collected from ESPN and Yahoo over at (click name) if anyone is interested.  I’m looking to add MLB ids soon, and if anyone has any other sites to collect ids from, please let me know and I can re-direct my rudimentary spider to collect those ids.


#8    Mike Fast      (see all posts) 2008/06/23 (Mon) @ 17:33

The MLB.com-to-Lahman cross-reference file I published was here:
http://mikefast.googlepages.com/elias_to_lahman.csv

I called the MLB.com ID an Elias ID because that’s what Joseph Adler called it in his Baseball Hacks book.  I had someone else (I forget who) confirm that it was an Elias ID, but they may have had the same reason for doing that as I did, for all I know.

Tango, I know you said that Cory said those were MLB.com’s ID’s and not Elias, but I’m curious whether they were Elias ID’s at some point in the past, or if anyone knows what Adler’s reason was for calling them Elias ID’s.  It sorta makes sense that they may have been Elias ID’s originally since Elias was the official statkeeper for MLB.


#9    Tangotiger      (see all posts) 2008/06/23 (Mon) @ 19:50

Post 7 was marked for moderation and is now available.


#10    MGL      (see all posts) 2008/06/23 (Mon) @ 19:56

And don’t forget to account for players who change their names, like Santiago Casilla (RHRP) of the A’s.  He used to be Jairo Garcia but changed his name (from a phony one to his real one, because he lied about his age) in 2006.


#11    Captain Crunch      (see all posts) 2008/06/23 (Mon) @ 21:04

Rally #2 ...

Yes, you can strip the IDs out of html links, but some are more work than others.  It involves Perl scripting. If you’re interested, let me know and I can provide you with some help.  I’m no Perl whiz, but I can (generally) make it work.  Sometimes I have to use a workaround.

btw, I have added the Fangraphs IDs to my files (click name)

Stripping the links is useful if you cannot download a file ... and not many sites (in my experience) include the numerical ID in a file. Another problem is sites that use asp to provide web pages.  Then you’ve got to trap the code with a sniffer and work from there.


#12          (see all posts) 2008/06/24 (Tue) @ 06:32

I forgot about name changes. How about Ian Oquendo to Ian Snell? It took baseballcube a little time to merge them into one player.

In my database, I have no name field in the ballparks table, only ID, location and other descriptive info. Then I have a seperate names table, which is linked to the IDs in a ballpark configuration table, which for each season, for seach ballpark ID, lists the name used, the version number I assigned for park factors, and the dimensions that season.

For players, the birth name (as best we know) could be in the persons table, with alternate names going into an alias table, which would then similarly be linked year by year, person ID by person ID, as to what name was used that season.

And I saw person, and not players, because RetroIDs are used for non playing umpires and managers. The persons table has name, birth, death, etc that every person has. The players table would have handedness, etc, that only are important for players, and not managers or umpires, but which allows a person to assume any of these roles at any time.


#13    Tangotiger      (see all posts) 2008/06/24 (Tue) @ 09:54

From Joe Arthur:

===========================================
The first spreadsheet has a second page with a few notes. I have cross-references through 2007 for most of the IDs mentioned by Colin in #4 of the Mapping IDs thread. I did the work sometime last winter.

http://spreadsheets.google.com/pub?key=psZsFBVYwUe1Xh5dEBfUOfQ
http://spreadsheets.google.com/pub?key=psZsFBVYwUe2aPC2R_I0NaA
http://spreadsheets.google.com/pub?key=psZsFBVYwUe14lH7xdl81bQ


#14    Colin Wyers      (see all posts) 2008/06/24 (Tue) @ 13:04

Here’s what I’m thinking, as far as a database setup.

ID_MAP_MLB - All the mappings for players in MLB.
ID_MAP_MINORS, ID_MAP_NCAA, ID_MAP_NPB, etc. - Mappings for other leagues.
ID_MAP_COMBO - To show us which players move between leagues, and what their IDs are.
ID_MAP_CHANGE - For players who change their names.

Does that work for everyone? Also, what do we want to use as a primary key for ID_MAP_MLB? Do we want to (shudder) create our own ID? Or use Baseball-Reference as the “default ID?”


#15    KJOK      (see all posts) 2008/06/24 (Tue) @ 14:59

A few semi-random points:

I think Ted Turocy has already done quite a bit of player ID mapping, so hopefully he’ll chime in.

The challenge is not so much creating a X-reference, it’s keeping the X-reference up to date.

If you’re going to create such a monster, I think it makes sense to do it ‘right’, and include the largest common pool of players - minor leaguers.  Which means including the SABR MILB DB ID’s.  If you’re going to go to this amount of effort, make it so the X-reference monster table can tie in ‘new’ datasets, such as High school attended, salaries, transactions, injuries, etc.


#16    Tangotiger      (see all posts) 2008/06/24 (Tue) @ 15:12

I agree with KJOK’s assessment that the bear is in terms of upkeep.  Whatever we do has to be as plug-in as possible.


#17    Colin Wyers      (see all posts) 2008/06/24 (Tue) @ 16:44

KJOK, is there any way we could get a copy of the master ID list for the Minor League Encyclopedia?


#18          (see all posts) 2008/06/24 (Tue) @ 16:54

This gets into RetroSQL territory...we need multiple IDs in a player table

RetroID provides links to the entire history of MLB, as do other IDs

I will be adding minor league data to my database, including play by play from 2005 forward. These use the MLBAM IDs. When a player reaches the majors and gets a RetroID, we need a preferably automated system to make the match.

MLBAM’s minor league xml files are on the same server and have the same file format as the major league files. There are fields that give a players birthdate and handedness as well as their IDs and name. I’m currently using Mike Fast’s parser, but it only enters ID and name in the players table. Havng the date of birth and handedness would go a long way towards creating a virtually unique key to do automated matching to a table with a different ID system.


#19          (see all posts) 2008/06/24 (Tue) @ 17:00

wrt #14:

I created my own ID# because I needed a common reference set of names that included everyone.  This enabled me to add a new set reasonably quickly - as this has to be done by the computer and not by me comparing player-by-player. My original attempts at this used the IDs from the Lahman db, but I had to create my own because I wanted a list of current players and the other db is only up to last year. The ID that I’ve created is irrelevant and can be discarded by anyone who downloads the file.

... and wrt #15 and #16

I’ve also worried about the energy that would be required for maintenance of the file that I’ve created.  I have to be able to do this programmatically and have work to do there as my Perl skills aren’t good enough.


#20    Colin Wyers      (see all posts) 2008/06/24 (Tue) @ 17:07

Also, if we’re going to go that far, might as well include NPB. The best resource I’ve found for that is here:

http://japanesebaseball.com/forum/thread.jsp?forum=17&thread=5709

and here:

http://japanesebaseball.com/data/index.jsp


#21    Ted Turocy      (see all posts) 2008/06/24 (Tue) @ 20:33

Yes, I have been trying to create such a cross-reference as part of the SABR Minor Leagues database.  We have a partial cross-reference among:

* SABR Minor Leagues Database IDs (which covers the largest set of players)
* SABR Person IDs (covers major league personnel only)
* Retrosheet IDs
* baseball-databank IDs
* MLBAM IDs
* Michael Westbay’s NPB database
* The integer minor league IDs baseball-reference.com uses for the minors portion of its site.

I’d have no problem making a dump of the concordance available for players where we have an ID other than the SABR minors ID.  That would cover, essentially, major league players plus recent (2005-) players in affiliated minors.  I assign SABR database IDs to all players as they debut, so the database contains all players who made their pro debut through yesterday (for affiliated leagues) and through earlier this month (for indy leagues).

Dumping the entire list of SABR minor league IDs is probably impractical, since we have 166,596 people in the database (as of this afternoon).  Many of these people we have only very sketchy information on, some people may have two records in the database, other cases we have playing records from multiple people assigned to the same person, and so on—so for these players, the SABR minor league IDs are extremely provisional.  But, for major leaguers and for players with MLBAM IDs, they are stable enough I would feel comfortable making them available.


#22    KJOK      (see all posts) 2008/06/24 (Tue) @ 22:11

WRT #17 - As Ted said, the SABR Minor League Database ID’s are really still in ‘alpha’ state, and ID’s for ‘older’ season players are not quite ready to be released.


#23    Bryan Donovan      (see all posts) 2008/06/30 (Mon) @ 00:16

I just want to note that the Hardball Times uses BIS (Baseball Info Solutions) IDs for player IDs, which is the same as Fangraph’s. 

I’ve written a crude script that does a pretty decent job of matching PitchFX (MLB) IDs to BIS IDs, and I think it could be adapted to other IDs as well.  The script basically compares two tables with player IDs and guesses which ones match.  It’s a rather strict guess, so if there are any players with the same name, etc., it won’t flag them as matches.

Here’s the script: (click name)


#24          (see all posts) 2008/06/30 (Mon) @ 06:52

I created my file with the aid of an Excel macro that I downloaded.  The List Compare Macro is available here


#25    Tangotiger      (see all posts) 2008/07/08 (Tue) @ 08:15

Post 23 was marked for moderation and is now open.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main