THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, October 31, 2006

Database Hacks

By Tangotiger, 09:45 AM

I recommend Joe Adler’s Baseball Hacks.  This blog entry will be some of my hacks in using various data.  Here’s one for the Lahman database:


Want to know who played the most games at a particular position?

SELECT yearID, teamID, Pos, Mid(Max(Str(1000+[g])+"_"+[playerid]),7) AS player
FROM Fielding
GROUP BY yearID, teamID, Pos;

This is in Access.  You can do an equivalent one in any database.  The key part is this:
Mid(Max(Str(1000+[g])+"_"+[playerid]),7)

The first thing it evaluates is:
Str(1000+[g])

What that does is creates a 4-digit number, and then converts it into a string.  I then attach the playerid to the end of that string.  This:
Max(Str(1000+[g])+"_"+[playerid])
just gets me the maximum value, which because the “1” is the first character in the string, cancels out for everyone, and the next significant character is the number of games.  Attaching the playerid breaks ties, though you can insert whatever tie-breaker you want in-between.

The MID function simply strips out the playerid from the above result.  Voila.  Instant starting players for each team.

In the Lahman DB, some years have “OF” (so you only get one player here) and some have the three OF positions broken down.  The “OF” only years is more cumbersome to handle.

SabermetricsData
#1    Rally      (see all posts) 2006/10/31 (Tue) @ 11:56

Great book.  I’ve gotten into the excel web queries, great for grabbing park factors and defensive data on a daily basis. 

It doesn’t do much for getting minor league stats, as BA and MiLB format their stats in such a way to prevent you from using web queries.

I haven’t gotten into it yet, but there are chapters on how to use the retrosheet data.  I wish I had the time to devote to that.


#2    John Beamer      (see all posts) 2006/10/31 (Tue) @ 12:59

A review I wrote a few months ago at BtB.

http://www.beyondtheboxscore.com/story/2006/5/1/42244/87589


#3    John Walsh      (see all posts) 2006/11/02 (Thu) @ 07:06

Great hack! The problem of finding who had the most of just about anything always required (for me) two steps: first finding the max value and then searching for the name that went with it.

This sorting of a string that contains both the value and the name is an excellent idea.

Keep these coming, Tango.


#4    tangotiger      (see all posts) 2007/01/17 (Wed) @ 18:14

Here’s how to figure out a player’s primary position:

SELECT
f2.yearid,
f2.playerid,
Mid(Max(Str(1000+f2.GP)+"_"+f2.pos),7) AS PrimPos,
Max(f2.GP) AS GP_Pos, Sum(f2.GP) AS GP_sum,
Sum(1) AS numOfPos,
Iif(numofPos=1,"”,Mid(Min(Str(1000+f2.GP)+"_"+f2.pos),7)) AS LastPos

FROM

[select
yearid,
playerid,
pos,
sum(g) as GP

from fielding
group by yearid, playerid, pos
]. AS f2

GROUP BY f2.yearid, f2.playerid

As usual, this is in Access.  It would be very similar in any other language.  Remove the square brackets around the inner select and the period.  Use SUBSTR instead of MID.  That’s pretty much it for the other databases.

Anyway, the key here has already been discussed in the main blog entry. 

Also note a subselect.  The FROM clause expects a table, or anything that looks like a table.  So, that inner view simply gives us the GP on a player/year/pos level, so that if a player played on multiple teams in a year, then we just get back one record for him.  You could have skipped this part if you wanted to know the main position for each player/year/team/stint.  Note that a guy who plays on the same team as his first and third team in the same year, will get a different stint id.

So, with this, you now have a record of a player’s primary position.  Unfortunately, this makes the 1918 for Ruth a primary OF, not P.  So, here’s another version that you may prefer:

SELECT
f2.yearid,
f2.playerid,
IIf(GP_Pos>IP,PrimPos,"P") AS Prim,
Mid(Max(Str(1000+f2.GP)+"_"+f2.pos),7) AS PrimPos,
Max(f2.GP) AS GP_Pos,
Sum(f2.GP) AS GP_sum,
Sum(1) AS numOfPos,
Iif(numofPos=1,"”,Mid(Min(Str(1000+f2.GP)+"_"+f2.pos),7)) AS LastPos
Sum(f2.INN) AS IP

FROM

[select
yearid,
playerid,
pos,
sum(g) as GP,
round(sum(Iif(pos="P",INNouts/3,0)),0) as INN

from fielding

group by yearid, playerid, pos
]. AS f2

GROUP BY f2.yearid, f2.playerid

In this case, the subselect also keeps track of the number of innings thrown as a pitcher.  Then in the main select, I have a new field called PRIM, and in there, I simply make the primary position the greater between the innings thrown and the games played.  So, for Ruth 1918, with 166 IP as a pitcher, and 59 games as an OF, he becomes a primary P.

On the other hand, his 1919 (133 IP as pitcher, 111 games as OF), he’s still primarily a pitcher.  I like it, but many people may not.  That’s ok, because those people jsut need to tweak this condition:
GP_Pos>IP
to whatever they want, like:
GP_Pos*1.5>IP

The way I see it, an IP is about 4 batters faced.  And a fielding game is about 4 balls in play per position (more for SS/2B, less for LF/RF/1B). 

So, do what you want.

The last sql returns the following for Ruth (apologies for the formatting):

yearid playerid Prim PrimPos GP_Pos GP_sum  numOfPos LastPos IP

1914 ruthba01 P P 4 4 1 23
1915 ruthba01 P P 32 32 1 218
1916 ruthba01 P P 44 44 1 324
1917 ruthba01 P P 41 41 1 326
1918 ruthba01 P OF 59 92 3 1B 166
1919 ruthba01 P OF 111 133 3 1B 133
1920 ruthba01 OF OF 141 144 3 P 4
1921 ruthba01 OF OF 152 156 3 1B 9
1922 ruthba01 OF OF 110 111 2 1B 0
1923 ruthba01 OF OF 148 152 2 1B 0
1924 ruthba01 OF OF 152 152 1 0
1925 ruthba01 OF OF 98 98 1 0
1926 ruthba01 OF OF 149 151 2 1B 0
1927 ruthba01 OF OF 151 151 1 0
1928 ruthba01 OF OF 154 154 1 0
1929 ruthba01 OF OF 133 133 1 0
1930 ruthba01 OF OF 144 145 2 P 9
1931 ruthba01 OF OF 142 143 2 1B 0
1932 ruthba01 OF OF 128 129 2 1B 0
1933 ruthba01 OF OF 132 134 3 1B 9
1934 ruthba01 OF OF 111 111 1 0
1935 ruthba01 OF OF 26 26 1 0


#5    tangotiger      (see all posts) 2007/01/17 (Wed) @ 18:28

So, here are the batting stats, in all of baseball history, by year, by position.

Have fun!


#6    tangotiger      (see all posts) 2007/06/16 (Sat) @ 08:18

For those of you who have the PARKS database from Yahoo Group KJOKbaseball:

In Access:
Click QUERIES
Click NEW
Select DESIGN VIEW
Click OK

(You will be presented with SHOW TABLE.)

Click CLOSE
Click SQL

Copy/Paste this directly:
SELECT Year, TeamID, Sum(1) AS Num_Parks, Mid(Max(Str(90000+[GP_H]*10+[SEQ])+[ParkID]),7) AS Main_Park, Max(GP_H) AS Main_GP, Sum([GP_H])-Max([GP_H]) AS Other_GP
FROM Home_Main_Data
GROUP BY Year, TeamID;

Click SAVE Icon
Type Main_Home_Park
Click OK
Click VIEW Icon

You will see the following:
Year
TeamID
Num_Parks: Number of Home Parks for that team that year
Main_Park: The home park that team played the most often that year
Main_GP: Number of home games at that park
Other_GP: Number of home games at other parks


#7    tangotiger      (see all posts) 2007/06/16 (Sat) @ 08:32

This will give you the main home park, ever, for a team:

SELECT Team_Parks.teamid, Sum(1) AS Num_Parks, Mid(Max(Str(9000000000+[GP_H]*10000+[last_year])+[ParkID]),12) AS Main_Park, Max(Team_Parks.GP_H) AS Main_GP, Sum([GP_H])-Max([GP_H]) AS Other_GP, Mid(Max(Str(9000000000+[GP_H]*10000+[last_year])+[ParkID]),8,4) AS park_last_year, Max(Team_Parks.last_year) AS team_last_year
FROM [select teamid, parkid, sum(home_main_data.GP_H) as GP_H, min(year) as first_year, max(year) as last_year from Home_Main_Data group by teamid, parkid
]. AS Team_Parks
GROUP BY Team_Parks.teamid;


#8    Anthony      (see all posts) 2007/06/17 (Sun) @ 11:44

If you don’t mind a more novice question: how do you do year-to-year queries in Access (e.g., every batter who hit .350 in a season, and what they hit the next year)?


#9    tangotiger      (see all posts) 2007/06/17 (Sun) @ 17:21

Here’s a basic one:
SELECT b1.playerID, b1.yearID AS yr1, b2.yearID AS yr2, b1.AB AS ab1, b1.H AS h1, b2.AB AS ab2, b2.H AS h2
FROM Batting AS b1 INNER JOIN Batting AS b2 ON (b1.teamID = b2.teamID) AND (b1.playerID = b2.playerID)
WHERE (((b2.yearID)=[b1].[yearid]+1));

***

You have to worry a bit about the “stintid” because you can place for the same team twice in the same year.  That’ll mess you up a bit.  So, you should first create a query called BATTING2, with a group by on playerid,yearid (and teamid if you want).  And then use that in your FROM clause above.

Also, this is an inner join, meaning that if he does not play in year2, he won’t show up at all.  In this case, again a bit more complicated (use LEFT or RIGHT JOIN), among other things.


#10    Cliff Otto      (see all posts) 2008/02/23 (Sat) @ 02:17

>Here’s how to figure out a player’s primary position

I couldn’t get either of the two options you presented in the above section to work in Access and I don’t know SQL well enough to figure out why but it seems to hang on the IIF statement for LastPos.


#11    Tangotiger      (see all posts) 2008/02/29 (Fri) @ 16:51

Cliff/10: I’ll have to check my DB at home.

***

Preview of every section in Adler’s book:
http://www.oreilly.com/catalog/baseballhks/toc.html


#12    Tangotiger      (see all posts) 2008/03/12 (Wed) @ 13:21

Cliff, the problem is with the blog software.  It should look like:

""

But instead, it appears on my site as:

"”


#13    Tangotiger      (see all posts) 2008/04/02 (Wed) @ 10:27

I posted this elsewhere:

========================================
In Access, as in most of DB, you don’t need the “IF” or “CASE”.

In this case, you can do:

select player, -sum([hitvalue]=4) as HR
group by player

[hitvalue]=4 will evaluate as 0 (false) or -1 (true) in Access.

You’d have to play around other DBs to see how they handle it.


#14    Bill      (see all posts) 2009/05/08 (Fri) @ 17:47

Thanks for sharing this!

For anyone who is interested, here’s how to figure out a player’s primary position in MySQL:

SELECT
f2.yearid,
f2.playerid,
MID(MAX(CONCAT(1000 + CAST(f2.GP as CHAR(3)),"_",f2.pos)),6) AS PrimPos,
Max(f2.GP) AS GP_Pos, SUM(f2.GP) AS GP_sum,
IF(SUM(1) = 1,"”,MID(MIN(CONCAT(1000 + CAST(f2.GP as CHAR(3)),"_",f2.pos)),6)) AS LastPos
FROM
(SELECT
yearid,
playerid,
pos,
SUM(g) as GP
from fielding
group by yearid, playerid, pos) AS f2
GROUP BY f2.yearid, f2.playerid


#15          (see all posts) 2010/09/04 (Sat) @ 17:57

For anyone who has made Hack #27 work (completely):  when I run this (using the updated version at http://baseballsimulator.com/blog/2010/04/15/how-to-maintain-an-up-to-date-baseball-database/), it does not distinguish whether a player is a member of the home or away team.  The lines in save_to_db that read the team_flag, including this code:

($team->{team_flag}=="home" ? $home : $away)

appear to always return ‘true’ and only the home team name is inserted into the database table. I have to jump through hoops to add the actual team name to a player.  I don’t want to just read the player’s team from a current team list because I want to know the team that he was a member of on the day of that game.

If this is the wrong place to ask ... my apologies.  If anyone has the answer or would like more information please e-mail me (including something appropriate in the subject line).

TIA


#16    Kincaid      (see all posts) 2010/09/04 (Sat) @ 18:54

Tim, where it says “==” in that line, change that to “eq”:

($team->{team_flag} eq “home” ? $home : $away)

Using == means the script is comparing numeric values, and using eq means it is comparing string values.  Since the team_flag and “home” are string values, when the script sees “==”, it will just call them both 0 and say they are equal, no matter what they actually say.  So when the team_flag is away, it will basically do:

“away” == “home” ?
0 == 0 ?

which is true, so it will never skip to displaying the $away variable.


#17          (see all posts) 2010/09/04 (Sat) @ 19:25

Yup, that’s it. Thank you very much.


#18    joe arthur      (see all posts) 2010/09/05 (Sun) @ 13:59

That bug is in the original program printed in Adler’s Baseball Hacks and also in the downloadable code (and the same bug is in several other lines). They are not listed in the errata section for the book on the O’Reilly website.

O’Reilly released another Adler book with some baseball content earlier this year: “R in a Nutshell.” I haven’t used it extensively, but it is not error free either. The Nutshell books typically work better as references than as books from which to teach yourself, but Adler has an engaging style and interesting examples, and I’m sure the R book will be of interest to any neophyte-to-intermediate R users among the blog readership ...


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 11 18:07
Hero of the month: Brittney Baxter

Feb 11 17:59
MGL: Today on Clubhouse Confidential

Feb 11 17:58
Clutch analogy

Feb 11 16:48
Reader Mail of the Day: Why do we need X years of fielding data?  And what about outliers?

Feb 11 11:54
Who is Jeremy Lin?

Feb 11 10:29
Dwight Evans

Feb 11 02:12
Performance through the ages

Feb 10 23:01
For Your Soul

Feb 10 18:32
Moneyball at Villanova

Feb 10 17:00
Psst… wanna intern in Canada?