THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, October 31, 2006

Database Hacks

By Tangotiger, 09:45 AM

I recommend Joe Adler’s Baseball Hacks.  This blog entry will be some of my hacks in using various data.  Here’s one for the Lahman database:


Want to know who played the most games at a particular position?

SELECT yearID, teamID, Pos, Mid(Max(Str(1000+[g])+"_"+[playerid]),7) AS player
FROM Fielding
GROUP BY yearID, teamID, Pos;

This is in Access.  You can do an equivalent one in any database.  The key part is this:
Mid(Max(Str(1000+[g])+"_"+[playerid]),7)

The first thing it evaluates is:
Str(1000+[g])

What that does is creates a 4-digit number, and then converts it into a string.  I then attach the playerid to the end of that string.  This:
Max(Str(1000+[g])+"_"+[playerid])
just gets me the maximum value, which because the “1” is the first character in the string, cancels out for everyone, and the next significant character is the number of games.  Attaching the playerid breaks ties, though you can insert whatever tie-breaker you want in-between.

The MID function simply strips out the playerid from the above result.  Voila.  Instant starting players for each team.

In the Lahman DB, some years have “OF” (so you only get one player here) and some have the three OF positions broken down.  The “OF” only years is more cumbersome to handle.

SabermetricsData
#1    Rally      (see all posts) 2006/10/31 (Tue) @ 11:56

Great book.  I’ve gotten into the excel web queries, great for grabbing park factors and defensive data on a daily basis. 

It doesn’t do much for getting minor league stats, as BA and MiLB format their stats in such a way to prevent you from using web queries.

I haven’t gotten into it yet, but there are chapters on how to use the retrosheet data.  I wish I had the time to devote to that.


#2    John Beamer      (see all posts) 2006/10/31 (Tue) @ 12:59

A review I wrote a few months ago at BtB.

http://www.beyondtheboxscore.com/story/2006/5/1/42244/87589


#3    John Walsh      (see all posts) 2006/11/02 (Thu) @ 07:06

Great hack! The problem of finding who had the most of just about anything always required (for me) two steps: first finding the max value and then searching for the name that went with it.

This sorting of a string that contains both the value and the name is an excellent idea.

Keep these coming, Tango.


#4    tangotiger      (see all posts) 2007/01/17 (Wed) @ 18:14

Here’s how to figure out a player’s primary position:

SELECT
f2.yearid,
f2.playerid,
Mid(Max(Str(1000+f2.GP)+"_"+f2.pos),7) AS PrimPos,
Max(f2.GP) AS GP_Pos, Sum(f2.GP) AS GP_sum,
Sum(1) AS numOfPos,
Iif(numofPos=1,"”,Mid(Min(Str(1000+f2.GP)+"_"+f2.pos),7)) AS LastPos

FROM

[select
yearid,
playerid,
pos,
sum(g) as GP

from fielding
group by yearid, playerid, pos
]. AS f2

GROUP BY f2.yearid, f2.playerid

As usual, this is in Access.  It would be very similar in any other language.  Remove the square brackets around the inner select and the period.  Use SUBSTR instead of MID.  That’s pretty much it for the other databases.

Anyway, the key here has already been discussed in the main blog entry. 

Also note a subselect.  The FROM clause expects a table, or anything that looks like a table.  So, that inner view simply gives us the GP on a player/year/pos level, so that if a player played on multiple teams in a year, then we just get back one record for him.  You could have skipped this part if you wanted to know the main position for each player/year/team/stint.  Note that a guy who plays on the same team as his first and third team in the same year, will get a different stint id.

So, with this, you now have a record of a player’s primary position.  Unfortunately, this makes the 1918 for Ruth a primary OF, not P.  So, here’s another version that you may prefer:

SELECT
f2.yearid,
f2.playerid,
IIf(GP_Pos>IP,PrimPos,"P") AS Prim,
Mid(Max(Str(1000+f2.GP)+"_"+f2.pos),7) AS PrimPos,
Max(f2.GP) AS GP_Pos,
Sum(f2.GP) AS GP_sum,
Sum(1) AS numOfPos,
Iif(numofPos=1,"”,Mid(Min(Str(1000+f2.GP)+"_"+f2.pos),7)) AS LastPos
Sum(f2.INN) AS IP

FROM

[select
yearid,
playerid,
pos,
sum(g) as GP,
round(sum(Iif(pos="P",INNouts/3,0)),0) as INN

from fielding

group by yearid, playerid, pos
]. AS f2

GROUP BY f2.yearid, f2.playerid

In this case, the subselect also keeps track of the number of innings thrown as a pitcher.  Then in the main select, I have a new field called PRIM, and in there, I simply make the primary position the greater between the innings thrown and the games played.  So, for Ruth 1918, with 166 IP as a pitcher, and 59 games as an OF, he becomes a primary P.

On the other hand, his 1919 (133 IP as pitcher, 111 games as OF), he’s still primarily a pitcher.  I like it, but many people may not.  That’s ok, because those people jsut need to tweak this condition:
GP_Pos>IP
to whatever they want, like:
GP_Pos*1.5>IP

The way I see it, an IP is about 4 batters faced.  And a fielding game is about 4 balls in play per position (more for SS/2B, less for LF/RF/1B). 

So, do what you want.

The last sql returns the following for Ruth (apologies for the formatting):

yearid playerid Prim PrimPos GP_Pos GP_sum  numOfPos LastPos IP

1914 ruthba01 P P 4 4 1 23
1915 ruthba01 P P 32 32 1 218
1916 ruthba01 P P 44 44 1 324
1917 ruthba01 P P 41 41 1 326
1918 ruthba01 P OF 59 92 3 1B 166
1919 ruthba01 P OF 111 133 3 1B 133
1920 ruthba01 OF OF 141 144 3 P 4
1921 ruthba01 OF OF 152 156 3 1B 9
1922 ruthba01 OF OF 110 111 2 1B 0
1923 ruthba01 OF OF 148 152 2 1B 0
1924 ruthba01 OF OF 152 152 1 0
1925 ruthba01 OF OF 98 98 1 0
1926 ruthba01 OF OF 149 151 2 1B 0
1927 ruthba01 OF OF 151 151 1 0
1928 ruthba01 OF OF 154 154 1 0
1929 ruthba01 OF OF 133 133 1 0
1930 ruthba01 OF OF 144 145 2 P 9
1931 ruthba01 OF OF 142 143 2 1B 0
1932 ruthba01 OF OF 128 129 2 1B 0
1933 ruthba01 OF OF 132 134 3 1B 9
1934 ruthba01 OF OF 111 111 1 0
1935 ruthba01 OF OF 26 26 1 0


#5    tangotiger      (see all posts) 2007/01/17 (Wed) @ 18:28

So, here are the batting stats, in all of baseball history, by year, by position.

Have fun!


#6    tangotiger      (see all posts) 2007/06/16 (Sat) @ 08:18

For those of you who have the PARKS database from Yahoo Group KJOKbaseball:

In Access:
Click QUERIES
Click NEW
Select DESIGN VIEW
Click OK

(You will be presented with SHOW TABLE.)

Click CLOSE
Click SQL

Copy/Paste this directly:
SELECT Year, TeamID, Sum(1) AS Num_Parks, Mid(Max(Str(90000+[GP_H]*10+[SEQ])+[ParkID]),7) AS Main_Park, Max(GP_H) AS Main_GP, Sum([GP_H])-Max([GP_H]) AS Other_GP
FROM Home_Main_Data
GROUP BY Year, TeamID;

Click SAVE Icon
Type Main_Home_Park
Click OK
Click VIEW Icon

You will see the following:
Year
TeamID
Num_Parks: Number of Home Parks for that team that year
Main_Park: The home park that team played the most often that year
Main_GP: Number of home games at that park
Other_GP: Number of home games at other parks


#7    tangotiger      (see all posts) 2007/06/16 (Sat) @ 08:32

This will give you the main home park, ever, for a team:

SELECT Team_Parks.teamid, Sum(1) AS Num_Parks, Mid(Max(Str(9000000000+[GP_H]*10000+[last_year])+[ParkID]),12) AS Main_Park, Max(Team_Parks.GP_H) AS Main_GP, Sum([GP_H])-Max([GP_H]) AS Other_GP, Mid(Max(Str(9000000000+[GP_H]*10000+[last_year])+[ParkID]),8,4) AS park_last_year, Max(Team_Parks.last_year) AS team_last_year
FROM [select teamid, parkid, sum(home_main_data.GP_H) as GP_H, min(year) as first_year, max(year) as last_year from Home_Main_Data group by teamid, parkid
]. AS Team_Parks
GROUP BY Team_Parks.teamid;


#8    Anthony      (see all posts) 2007/06/17 (Sun) @ 11:44

If you don’t mind a more novice question: how do you do year-to-year queries in Access (e.g., every batter who hit .350 in a season, and what they hit the next year)?


#9    tangotiger      (see all posts) 2007/06/17 (Sun) @ 17:21

Here’s a basic one:
SELECT b1.playerID, b1.yearID AS yr1, b2.yearID AS yr2, b1.AB AS ab1, b1.H AS h1, b2.AB AS ab2, b2.H AS h2
FROM Batting AS b1 INNER JOIN Batting AS b2 ON (b1.teamID = b2.teamID) AND (b1.playerID = b2.playerID)
WHERE (((b2.yearID)=[b1].[yearid]+1));

***

You have to worry a bit about the “stintid” because you can place for the same team twice in the same year.  That’ll mess you up a bit.  So, you should first create a query called BATTING2, with a group by on playerid,yearid (and teamid if you want).  And then use that in your FROM clause above.

Also, this is an inner join, meaning that if he does not play in year2, he won’t show up at all.  In this case, again a bit more complicated (use LEFT or RIGHT JOIN), among other things.


#10    Cliff Otto      (see all posts) 2008/02/23 (Sat) @ 02:17

>Here’s how to figure out a player’s primary position

I couldn’t get either of the two options you presented in the above section to work in Access and I don’t know SQL well enough to figure out why but it seems to hang on the IIF statement for LastPos.


#11    Tangotiger      (see all posts) 2008/02/29 (Fri) @ 16:51

Cliff/10: I’ll have to check my DB at home.

***

Preview of every section in Adler’s book:
http://www.oreilly.com/catalog/baseballhks/toc.html


#12    Tangotiger      (see all posts) 2008/03/12 (Wed) @ 13:21

Cliff, the problem is with the blog software.  It should look like:

""

But instead, it appears on my site as:

"”


#13    Tangotiger      (see all posts) 2008/04/02 (Wed) @ 10:27

I posted this elsewhere:

========================================
In Access, as in most of DB, you don’t need the “IF” or “CASE”.

In this case, you can do:

select player, -sum([hitvalue]=4) as HR
group by player

[hitvalue]=4 will evaluate as 0 (false) or -1 (true) in Access.

You’d have to play around other DBs to see how they handle it.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Nov 20 01:43
Sabermetric Moves of the 2009 Pre-Season

Nov 20 04:02
Nate Silver: hero to interviewers

Nov 20 02:01
My 1B is better than your 1B

Nov 20 00:26
MLB logo

Nov 19 23:03
NBA’s Marcel

Nov 19 19:13
Offense by position groups by decade

Nov 19 17:32
Changes in home run rates during the Retrosheet years

Nov 19 16:40
One Year and One Million Hits Later

Nov 19 16:22
Soria as a starter?

Nov 19 13:50
Response of a fired head coach