THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Thursday, November 13, 2008

Gameday Database

By Tangotiger, 12:16 PM

Courtesy of derigaz2, and found on the wiki.


SabermetricsData
#1    Matt      (see all posts) 2008/11/13 (Thu) @ 16:13

I’ve managed to import it, but might there be some readme (by you or otherwise) that might explain some of the tables more fully?

Also, has anyone else had a problem with null values in some of the pitch f/x columns?


#2          (see all posts) 2008/11/13 (Thu) @ 18:30

I downloaded the newest version of “Baseball on a Stick” the other day, will install it tonight and examine the schema. I had submitted my draft schema to them. Main thing that looks handy is they spider every GameDay file, instead of just some. This version is written in Python. The author says it’s easier than Perl, and considering I never really figured out Mike Fast’s Perl script, that may come in handy.


#3    Rally      (see all posts) 2008/11/13 (Thu) @ 22:35

Brian, do you have a link?  What is Baseball on a stick?


#4    Colin Wyers      (see all posts) 2008/11/13 (Thu) @ 23:01

Baseball On A Stick:

http://sourceforge.net/projects/baseballonastic/

I have it installed, but can’t get it to talk to my database (to be fair I only spent 2 minutes on the task); it looks like a great setup for someone that doesn’t already have MySQL installed. I also have the stuff Tango linked to here, and will look over it sometime this weekend, hopefully.

I currently use Mat Kovach’s parser and database schema. A lot of this stuff gets discussed on the RetroSQL list.


#5          (see all posts) 2008/11/14 (Fri) @ 02:21

I have the Fast and Kovach scripts, and have MySQL installed. I was able to modify the Kovach script to spider minor league data, but both are setup to download only certain files, as their schemas are geared more towards Pfx.

I want to have GameDay update a Retro database, but even though I get to do most of the programming at work, I couldn’t figure out enough of the Fast or Kavach scripts to get the extra xml files or parse the extra variables.

BBOS claims to download all the files, that’s step #1. It also claims to be configurable to your own schema. If I can figure that out, that’s #2 and I should be up and running.

If this works, I will start processing the minor league play by play for park factors, splits, fielding, baserunning, etc...I’ll be working on queries for awhile.


#6    Colin Wyers      (see all posts) 2008/11/14 (Fri) @ 03:47

For those looking to get started with setting up a baseball database and working with it, part one of my long-promised tutorial is finally up:

http://statspeak.net/2008/11/building-a-sabermetricians-workbench-part-i.html

I don’t get into any Gameday stuff, just the Baseball Databank (new release! tell your friends!).


#7    john      (see all posts) 2008/11/14 (Fri) @ 11:37

As far as the gameday stuff goes, I was wondering does anyone know how to write a query in MySQL that would give me pitches thrown on each count?  For instance, what im wanting to do is take a pitcher and say he threw this type of pitch on this count etc etc.

The problem is in the database, I just get end results.  That is, if a pitcher strikes a batter out on three pitches, I get three rows of data that says:

Balls Strikes
0 3
0 3
0 3

Is there a way to get the count from that?  I know Josh’s player cards have this info so there must be someway.


#8    4seamer      (see all posts) 2008/11/14 (Fri) @ 14:07

After a quick glance at the sql:

for those attempting to load the gameday data into a hosted MySQL server (most hosting companies require your login name in front of the database name), get a copy of sed (http://gnuwin32.sourceforge.net/packages/sed.htm) and use this command to change the database name in the schema to your desired datebase name:

sed -e ‘s/gameday/YOUR DATEBASE NAME HERE/g’ gd_07_08.sql > gd_07_08a.sql

Also, there is a call around line 720 that will probably cause this import to fail since on a Unix system you’ll need to be root which in a hosted environment isn’t going to happen.  I’ll look into it deeper later to see if there is a workaround.

john - I’ll load it later and see if I can build you a simple query.


#9    4seamer      (see all posts) 2008/11/14 (Fri) @ 15:59

matt - yeah, I only see 37,151 unique at bat id’s where the pitch type is not null.  I stopped dealing with it right there.


#10          (see all posts) 2008/11/14 (Fri) @ 16:02

Assuming that in your pitch table, you have fields for Pitcher_ID, Balls, Strikes, and Pitch_Type for each pitch…

then your query should group by Pitcher_ID, Balls and Strikes, and then split out the various value of Pitch_Type into their own variables, which are then summed, such as

SELECT
Pitches.Pitcher_ID,
Pitches.PitchHand,
Pitches.BatHand,
Pitches.Balls,
Pitches.Strikes,
Sum(IIF(Pitches.Pitch_Type="FA",1,0) AS FA,
Sum(IIF(Pitches.Pitch_Type="CB",1,0) AS CB,
Sum(IIF(Pitches.Pitch_Type="SL",1,0) AS SL,
Sum(IIF(Pitches.Pitch_Type="CH",1,0) AS CH
FROM
Pitches
GROUP BY
Pitches.Pitcher_ID,
Pitches.PitchHand,
Pitches.BatHand,
Pitches.Balls,
Pitches.Strikes

Make a copy and drop Pitcher_ID and it will give you the league totals that you can compare each individual to


#11    john      (see all posts) 2008/11/14 (Fri) @ 19:15

wow thanks.  That looks like it would work out.

Prehaps I should have asked this question beforehand but what I have been doing (and Im sure theres a better way) is querying a certain pitcher, getting all his pitches and exporting the data to Excel.  Afterwards, I been importing the data into SPSS and using K-means clustering for the pitch type.  Then I copy that K-Means column back into the Excel file.

I guess for this I would have to somehow import that column into my MYSQL database.  I’m not quite sure how to do that.  I’m not too savy on MySQL.  Im sure its simple.  How would I do this?


#12    john      (see all posts) 2008/11/14 (Fri) @ 19:19

Sorry for asking so many questions but will that above query work if this is what I have.

Pitcher strikes batter out 3 pitches.  So I get this

Pitcher ID Strikes Balls
276373 3 0
276373 3 0
276373 3 0

I was thinking that I would need the data to be:

Pitcher ID Strikes Balls
276373 1 0
276373 2 0
276373 3 0


#13    Rally      (see all posts) 2008/11/14 (Fri) @ 21:52

Colin, just want to thank you for that series.  I’ve been using Access & Excel for years.

I downloaded the MySQL database, and brought in Baseball Databank and Gameday.  It might take me awhile to get a hang of using them.

Once you get crunching, is MySQL faster than Access?


#14    Colin Wyers      (see all posts) 2008/11/15 (Sat) @ 02:55

We talking speed of execution or speed of writing queries?

As you move up in size, MySQL starts to really outperform Access in query speed, in my opinion. For simple queries in smaller tables, the difference probably isn’t noticable.

I find it quicker to write a query than to build one through a wizard, now that I have a grasp of a lot of the commands I use most frequently. I also have an utter rat’s nest of code I’ve written before than I can call upon - I’m in the process of cleaning some of that up in anticipation of sharing it, starting with the Marcels SQL.

Right now, the Marcels SQL can generate hitter projections from the BDB for 2009 in under 10 seconds. (The largest consumer of time is the table that filters out pitchers hitting, actually - that takes four seconds, although I’m not at all certain I’ve written that as elegantly as I can.) I’m sure that a system like CHONE would be quite a bit more complex than the simple Marcels, but I think you’d see some performance gains.

The big reason for using MySQL over Access for me is to be able to do research using Retrosheet PBP data. Access simply can’t handle a table that large.

That said, you can use both if you like. Using the MySQL ODBC Connector, I can use Access to query a MySQL database. It’s not quite as efficient, and I don’t think it’s entirely stable on very large tables (read: Retrosheet events).


#15    terpsfan101      (see all posts) 2008/11/15 (Sat) @ 03:10

MS Access works just fine in terms of speed with the BDB database. It begins to slow down when you work with Retrosheet data. The 2 GB file-size is Access’s biggest limitation. You can not build a complete PBP database using all of the Retrosheet seasons because of this. And Access is slow as molasses working with Retrosheet data. I’ve heard Tango say that Access works well with up to 1,000,000 records. Try using Access with 7,000,000 records.

Of course, none of you guys are mentioning how much memory your computer has, and how fast your processer is. This will affect the speed of your queries more than the program you are using will. My desktop only has 256 MB of memory, and my laptop only has 512 MB, so I end up wasting a lot of time waiting.


#16    Colin Wyers      (see all posts) 2008/11/15 (Sat) @ 03:28

I’m essentially stuck with the computing power I have, and I don’t think I have any more than you do, terpsfan. You can squeeze a lot of speed out of your database through smart query writing and by optimizing your tables ahead of time.

I posted this on the RetroSQL list a while back, just to give an idea of how amazing partitioning a large table can be, especially when memory is limited.

Here’s the test query:

SELECT e.OUTS_CT, e.START_BASES_CD, AVG(e.EVENT_RUNS_CT +
e.FATE_RUNS_CT) as RUNS
FROM retrosheet.events e
GROUP BY OUTS_CT, START_BASES_CD;

Which is a simple run expectancy generator by base-out state. I ran it over my entire Retrosheet database, which runs 8,290,901 rows long.

Non-partitioned:
(24 row(s)returned)
(309531 ms taken)

Partitioned:
(24 row(s)returned)
(109422 ms taken)

So we go from over 5 minutes to under 2 minutes. The more complex the query you’re running, the greater the benefits in speed as well. You will see serious boosts in speed when dealing with Retrosheet data if you use a properly-configured MySQL table.

(When you partition by year, which is what I do, you can also achieve massive performance gains by filtering down to a subset of years. The fantastic part is that you can restrict your query to just one year while you test it and get immediate feedback, and then when you’re sure you’ve written it correctly, you can let it rip on the entire dataset.)


#17          (see all posts) 2008/11/15 (Sat) @ 03:44

I’ve got 1 gb of ram and a 2 gb processor at home. At work it’a 4 gb of ram and a dual core 3 gb processor.

In MS Access, I have the Retrosheet, DBD, KJOK & Westbay Japanese DBs merged into one, except no events table from Retro. I import 10 years at a time as necessary and run make table queries on the subject at hand.

For my Oliver projections, currently I have about 24,000 batting lines for 2934 players. The query that normalizes the batting for level and park factors takes 3-4 seconds. However, to run projections I combine those into a single line for each season, and then weight each season for each season projected, which is around 200,000 records, and that takes about 4 minutes at home, 1.5 minutes at work.


#18    Rally      (see all posts) 2008/11/15 (Sat) @ 11:52

I’ve got 1 gig of ram and a 3 gb processor.

I’ve split the retrosheet stuff up by years and run one group at a time.


#19    terpsfan101      (see all posts) 2008/11/15 (Sat) @ 12:27

I know that partioning a hard-drive when you re-format it will increase the performance of a computer. Of course, you can only partition a hard-drive by re-formatting it. I partitioned my dad’s hard drive so all of the system files were at the beginning sectors of the drive. Files can be accessed quicker at the beginning sectors of the drive. After the system files, I partitioned a 4.38 GB space for a DVD image. Then, he could burn DVD’s at the maximum speed, and not have to worry about overflowing the buffer. The very last sectors of the drive were partioned for copies of his important files.

Partioning is very effective. I’m not suprised by Colin’s performance gains.

You also want to make sure you de-fragment your hard-drive regularly. Of course, make sure you are plugged into an uninterruptible power supply when you defragment. Another thing I do to free up memory is turn off unecessary Windows Services, like Automatic Updates and the Fax Service.

If your PC doesn’t have a lot of memory, you basically want to turn off/disable everything that you are not using when doing something that requires a lot of memory. This means closing your spyware/firewall programs and web browsers, and any other programs you have set to automatically load when Windows starts.


#20    Colin Wyers      (see all posts) 2008/11/15 (Sat) @ 12:43

Partitioning means something different for MySQL - you break the table down into smaller tables. Whenever you run a query, MySQL tries to load the entire table into memory (or at least the index), and so you churn the hard drive. By splitting the table up you are able to fit it into memory easier.

Indexing a table is another great way to improve performance, especially in the use cases we have, when you’re doing very few INSERTs and UPDATEs and a whole lot of SELECTs.


#21    terpsfan101      (see all posts) 2008/11/15 (Sat) @ 12:59

If you seperate a hard-drive into 5 partitions, it’s like having 5 seperate hard-drives. Each has a different letter C:/, D:/, ...

So partitioning a hard-drive is exactly like breaking down one large database table into smaller tables.


#22    Rally      (see all posts) 2008/11/15 (Sat) @ 23:44

I think I’ve found a problem with the Gameday database linked on the wiki, or else I did something wrong in setting it up.

If I wanted to look at velocity by pitcher, I’d go to the pitch table, which has all the pitch data and an atbat ID.  Then I look at the atbat table to find the pitcherID, the batter and pitcher fields do not contain player ID’s, instead it’s a 2 digit number (uniform #?)


#23    Rally      (see all posts) 2008/11/15 (Sat) @ 23:47

Never mind.  I’ve got to map that number to the player_team table, which then maps to the player table.


#24    David Pinto      (see all posts) 2008/11/15 (Sat) @ 23:57

Hi.  I’ve loaded the data into a MySQL database.  Does someone have a link to the definitions of the fields in the pitch table?  I get the idea that the the various X,Y, and Z fields are coordinates and speeds, but it’s not clear for what exactly, and which axis is which.

Thanks.


#25    Colin Wyers      (see all posts) 2008/11/16 (Sun) @ 00:19

Mike Fast’s glossary:

http://fastballs.wordpress.com/2007/08/02/glossary-of-the-gameday-pitch-fields/


#26    Rally      (see all posts) 2008/11/16 (Sun) @ 00:54

Frustrating to work with.  I can write a query, a pretty complex one actually, but can’t figure out how to save it.  I’m using the SQLyog program.


#27    SirKodiak      (see all posts) 2008/11/16 (Sun) @ 00:58

Dr. Alan Nathan’s glossary:
webusers.npl.uiuc.edu/~a-nathan/pob/tracking.htm

If you have Google’s SketchUp here is a model:
sketchup.google.com/3dwarehouse/details?mid=c0ce4fce69bdd0a3331a48af7bb20983


#28    Colin Wyers      (see all posts) 2008/11/16 (Sun) @ 01:10

I generally keep a scratchpad of a .txt file at hand for that - not the most convenient, I know.

For what you may be accustomed to in Access, try prefixing your query with:

CREATE VIEW view_name_here AS

And see what that does for you. It will file it under the view folder, not the table folder.


#29    Rally      (see all posts) 2008/11/16 (Sun) @ 01:27

Thanks, that worked.  This query will give you average fastball speed by pitcher:

create view Fastballs as
SELECT
player.playerid
, player.fname
, player.lname
, avg(pitch.start_speed)
, pitch.pitch_type
FROM
gameday.player_team
INNER JOIN gameday.player
ON (player_team.player$playerid = player.playerid)
INNER JOIN gameday.atbat
ON (player_team.player_team_id = atbat.pitcher)
INNER JOIN gameday.pitch
ON (atbat.atbatid = pitch.atbat$atbatid)
WHERE (pitch.pitch_type = “fa")
GROUP BY player.playerid, player.fname, player.lname;


#30    Sean      (see all posts) 2008/11/16 (Sun) @ 04:13

Ok, I’m new to mySQL and just downloaded all this.  Couple things:

Rally there is a small error in your query (I think), second to last line with the quotations. 

With Rally’s query is there a reason guys like Kershaw, R. Tejada, Liz didn’t show up for me?

Just playing around can you pull data like splitting up the data between starts and relief appearances?  And stuff like first 20 pitches, etc.?

Now to learn how to actually use mySQL.


#31    Rally      (see all posts) 2008/11/16 (Sun) @ 10:54

In the quotations, I’m selecting pitches that are coded as fastballs, “fa”.

You’re right that some players are missing.  There are a lot of pitches where start_speed is null, and also a lot of pitches that do have speed recorded but pitch_type is not coded (not sure when they implemented that, and I think Mike Fast found some errors in the way they did it.)

Change the WHERE to (start_speed is not null), and do a group by pitch_type and you’ll see a lot more data than my first query had.


#32    john      (see all posts) 2008/11/16 (Sun) @ 11:40

One thing you need to be aware of tho when it comes to the pitch_type column......gameday isnt very accurate when it comes to pitch type.  Fastballs I believe are pretty accurate but I’ve seen it mix up sliders/curves or sliders/changeups all the time.

Thats why you should really use a clustering algorthim to separate the pitches out, just to be more accurate.

For example the average speed of someone’s fastball isnt going to be accurate if gameday is lumping some 87mph changeups with it.


#33    Rally      (see all posts) 2008/11/16 (Sun) @ 13:05

I agree, if I were as deep into Pitch f/x as guys like Eric Seidman, Josh Kalk, and Mike Fast are (sorry if I missed anyone).

At this point I’m trying to learn mySql, while also getting a hang of the gameday structure.  Clustering algorithms are a pipe dream.

I have done some detective work and I believe this gameday database is incomplete.  I was looking for one of the guys my query came up with no data for, Jose Arredondo.  His playerID is 461766, which maps to 486 on the player_team table.  I could not find pitcher =486 on the pitch table.

I know he pitched on 5-26-08, pitched the 11th and 12th innings vs Detroit.  The inning ID’s are 47381 and 47383 - and those do not appear to be on the atbat table.


#34          (see all posts) 2008/11/16 (Sun) @ 14:20

Rally, one of the goals of getting us all on the same schema is that one of us can write a clustering algorithm and then share it with the rest.


#35    Colin Wyers      (see all posts) 2008/11/16 (Sun) @ 14:40

I do clustering with R; apparently the link to it got eaten by comment moderation. Try clicking on my name for a description. Automating it is the next step.


#36    Peter Jensen      (see all posts) 2008/11/16 (Sun) @ 14:57

Rally - Arrendondo is in the files I downloaded from Gameday for May 2008.  He also pitched on the 14th, 15th, and 18th, as well as the 26th. I downloaded the files directly from Gameday into Excel.  I don’t know why you can’t find him in the version you are using.


#37    john      (see all posts) 2008/11/16 (Sun) @ 15:31

I’ve been using SPSS to do K-means clustering which seems to give me good results.


#38    Rally      (see all posts) 2008/11/16 (Sun) @ 16:43

I downloaded the file linked to above.

I set up the mysql database, and restored from sql dump this file: sql_gd_07_08.zip

If you went to the same source, I have no idea why you’d have records that I don’t.


#39    Tangotiger      (see all posts) 2008/11/16 (Sun) @ 16:48

Post 27 was marked for moderation and is now live.

Colin also had a post marked for moderation, but he reposted successfully.


#40    Peter Jensen      (see all posts) 2008/11/16 (Sun) @ 17:12

Rally - I didn’t go to the linked file.  I downloaded the XML files directly from Gameday.  The information is in their XML files.  The problem is somewhere in the linked file or your downloading process.


#41    Jeremy      (see all posts) 2009/03/26 (Thu) @ 13:13

I believe I’m having the same problem as Rally. Has anyone diagnosed the bug?


#42    Jeremy      (see all posts) 2009/03/28 (Sat) @ 01:51

Tango,

Did the database work for you? Have you started doing pitch f/x analysis yet?


#43    Tangotiger      (see all posts) 2009/04/01 (Wed) @ 11:04

If anyone has experienced any problems and wants to offer a solution, please do so…


#44    Jeremy      (see all posts) 2009/04/27 (Mon) @ 21:27

Looks like a functional 2007-2008 gameday database has been posted at this url: http://www.wantlinux.net/2009/04/mysql-mlb-pitch-f-x-data/ Seems to be running smoothly for me. Can’t wait to start using it.


#45          (see all posts) 2009/04/28 (Tue) @ 13:51

My brother (the computer nerd of the two) and I are going to keep the Pitch F/X database up to date daily (between midnight and 4am MST) if the data is available.  If you have any concerns/comments/complaints just let us know on the download site or via email.


#46    john      (see all posts) 2009/07/14 (Tue) @ 13:38

Sorry to bump here but I’m trying to load Baseball on a Stick and I got a question

I went into DOS and did bbos.py -y 2009......that should give me all of 2009 data if I understand correctly.  I understand it should take awhile but shouldnt the files that it got from the gameday server show up on my computer somewhere?  I’m not getting any of them.


#47    Brian Cartwright      (see all posts) 2009/07/14 (Tue) @ 18:04

BBOS does not copy the Gameday .xml files to your computer.

It loads the data in the .xml files into memory, then writes the data in a MySQL database called ‘gameday’. From there queries can normalize and/or customize the data as you wish.


#48    Mike      (see all posts) 2009/07/15 (Wed) @ 12:49

Brian, you mentioned one of the benefits of everybody being on the same schema here is that one person can share a clustering algorithm with the rest.  Like Rally and others, I am just getting the hang of SQL, and this kind of algorithm is beyond my limited expertise.

Would anybody like to share a simple method to improve pitch-type classifications?


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 11 19:42
Who is Jeremy Lin?

Feb 11 19:33
Clutch analogy

Feb 11 19:12
Hero of the month: Brittney Baxter

Feb 11 17:59
MGL: Today on Clubhouse Confidential

Feb 11 16:48
Reader Mail of the Day: Why do we need X years of fielding data?  And what about outliers?

Feb 11 10:29
Dwight Evans

Feb 11 02:12
Performance through the ages

Feb 10 23:01
For Your Soul

Feb 10 18:32
Moneyball at Villanova

Feb 10 17:00
Psst… wanna intern in Canada?