THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, April 11, 2011

Baseball ID Mapping File: updated daily

By Tangotiger, 11:25 AM

Great stuff from Ted and friends.  For all those who are already wasting your valuable time matching IDs and the like: stop doing that sh!t.  I can’t count the number of hours I’ve wasted doing that.  Must be in the dozens, if not past 100 already.  And if you’ve got a new dataset that you can contribute (that’s not already mapped by Ted already… check to see if he has it), see if Ted can incorporate it.  This is a beautiful thing, and I’ve been transitioning all my mappings to his file.

I am pleased to announce that the Baseball ID Working Group is now publishing a register of professional players, which is updated daily. The Baseball ID Working Group is a consortium of data providers and analysts who are working together to publish a definitive register of basic identifying information on players, managers, and umpires throughout the history of professional baseball internationally, including a cross-reference table of major person identification systems.

The current (provisional) download site for this is:

http://balco.sabr.org/data/baseballid/

Each day’s release is dated in the filename; for convenience, the file baseballid-latest.zip always points to the most recent update.

Updates should happen approximately every morning. However, there is a manual approval process, so there may be an occasional day in which an update does not occur if I am not available to complete it.

These data are available under a Creative Commons non-commercial license for the benefit of the community.

There is a README.txt available which contains full details of what is present, and how to interpret/use it. Importantly, it also contains the names of several key contributors who have been instrumental in helping get this together.

Please direct enquiries to me offlist at dataczar (at) sabr (dot) org.

Ted


SabermetricsData
#1    MGL      (see all posts) 2011/04/11 (Mon) @ 12:40

Yes, this is fantastic!


#2    Peter      (see all posts) 2011/04/11 (Mon) @ 18:09

Fantastic! This deserves so much praise.

I created a very quick-and-dirty R script to update my MySQL database with the newest ID tables every night. It’s probably not relevant to all that many, but I figured some might find it useful, so I posted it--click my name.


#3          (see all posts) 2011/04/11 (Mon) @ 19:48

What I’d like to see along with this is standardized team and franchise three-letter identifiers.


#4    Tangotiger      (see all posts) 2011/04/11 (Mon) @ 20:06

I’m sure Ted can (or already does) incorporate this.  Consider these as entity or object IDs.


#5    Colin Wyers      (see all posts) 2011/04/11 (Mon) @ 20:53

Cliff,

The teams table in the Baseball Databank is a good step in this direction - that’ll net you Baseball Reference and Retrosheet team IDs (Retrosheet is what we use at BPro as well).

And just a note - I sent Ted our master player list from BPro, to incorporate into this as he sees fit.


#6    MGL      (see all posts) 2011/04/12 (Tue) @ 02:12

Retrosheet unfortunately decided in 2005 that Anaheim was going to be ALA. Everyone else uses LAA.  Retrosheet should have changed it, but I don’t think they did.  Otherwise retrosheet’s rules for team names are or should be the standard.  First three letter of the city.  If the city has two words then it is the first letter of each word followed by A or N.  If there is the same city in the AL and NL, then it is first two two letters followed by N or A.  I think that covers everything unless there is eventually a city with 3 words.  Then it could be the same as if it only had the first two words, or it could be the first letter of each word.  Plus, it could come up eventually that more than one team in each league could have the same name (like two NL teams in NY), in which case another rule would have to be made.


#7    Peter Jensen      (see all posts) 2011/04/12 (Tue) @ 10:02

Retrosheet unfortunately decided in 2005 that Anaheim was going to be ALA. Everyone else uses LAA.  Retrosheet should have changed it, but I don’t think they did.  Otherwise retrosheet’s rules for team names are or should be the standard.

MGL - Retrosheet uses ANA, (not ALA), for Anaheim because that’s what MLB uses in the files from which Retrosheet now gets its data.  I think everybody else uses LAA because that’s what they were using previously and it made no sense to change everything over.  Retrosheet and MLB for the current minor leagues should be the standard, although I think there is one duplication of three letter code between minor and major leagues.  For defunct major league teams Retrosheet also gives a 3 letter-number code for all the previous teams on its web site.  For defunct minor league teams it would make sense to probably follow a similar standard.


#8    Brian Cartwright      (see all posts) 2011/04/12 (Tue) @ 11:38

mlbam has many duplicate three letter team codes between levels - they are only unique on each level, and that’s how they implement them. For example, colmlb is Colorado, colaaa is Columbus, Ohio.

I’ve blended mlbam’s team codes with SABR’s and have so far managed to find a three letter code for every pro team, foreign and domestic, starting in 1998. This week I had to deal with the Tucson Padres, which is technically an expansion team, but got lazy and went with the same ‘TUC’ of the old team there.


#9    Brian Cartwright      (see all posts) 2011/04/12 (Tue) @ 18:54

I just successfully downloaded the file using this Python script, takes about 5 minutes for the 10 mb file.

Now I’ll write a dos batch file that calls this script to get the file, unzips it, and loads the csv files into MySQL

import urllib
urllib.urlretrieve( “http://balco.sabr.org/data/baseballid/baseballid-latest.zip”,’c:/mybbos/data/sabr/baseballid-latest.zip’ )

The smiley face is representing a single quote followed by a closed parentheses

(Ed note: I put a space and that takes care of the smiley issue.)


#10    KJOK      (see all posts) 2011/04/13 (Wed) @ 19:25

"What I’d like to see along with this is standardized team and franchise three-letter identifiers.”

I maintain a ‘professional teams’ table, MLB, minors, foreign teams, Negro League teams, etc. which is used for all of the SABR minor league data, etc. but obviosly I can’t control what Retrosheet or others have decided to use for MLB team identifiers.

There are currently almost 32,000 season/team entries.....


#11          (see all posts) 2011/04/13 (Wed) @ 21:55

Re: #5

Colin, that simplifies life. I never paid attention to the far end of the team table.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 23 01:15
How much should minor leaguers make?

Feb 22 22:31
Not everything you learn in college is true (duh)…

Feb 22 17:27
Would you cut to a regularly scheduled show, if the main event ran long?

Feb 22 17:02
This week in chart failure

Feb 22 16:26
Who’s evaluating the 2011 forecasts this year?

Feb 22 12:21
MLB 2012 Odds: BetOnline

Feb 22 07:11
K minus BB differential or ratio?

Feb 22 01:18
Two players have the same stats: one is much younger.  Which one will be better next year?

Feb 21 14:49
Knuckleball pitchers: all of them

Feb 21 13:57
Proper compensation for Epstein?