Tuesday, October 31, 2006
Database Hacks
I recommend Joe Adler’s Baseball Hacks. This blog entry will be some of my hacks in using various data. Here’s one for the Lahman database:
Want to know who played the most games at a particular position?
SELECT yearID, teamID, Pos, Mid(Max(Str(1000+[g])+"_"+[playerid]),7) AS player
FROM Fielding
GROUP BY yearID, teamID, Pos;
This is in Access. You can do an equivalent one in any database. The key part is this:
Mid(Max(Str(1000+[g])+"_"+[playerid]),7)
The first thing it evaluates is:
Str(1000+[g])
What that does is creates a 4-digit number, and then converts it into a string. I then attach the playerid to the end of that string. This:
Max(Str(1000+[g])+"_"+[playerid])
just gets me the maximum value, which because the “1” is the first character in the string, cancels out for everyone, and the next significant character is the number of games. Attaching the playerid breaks ties, though you can insert whatever tie-breaker you want in-between.
The MID function simply strips out the playerid from the above result. Voila. Instant starting players for each team.
In the Lahman DB, some years have “OF” (so you only get one player here) and some have the three OF positions broken down. The “OF” only years is more cumbersome to handle.
Great book. I’ve gotten into the excel web queries, great for grabbing park factors and defensive data on a daily basis.
It doesn’t do much for getting minor league stats, as BA and MiLB format their stats in such a way to prevent you from using web queries.
I haven’t gotten into it yet, but there are chapters on how to use the retrosheet data. I wish I had the time to devote to that.