THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Wednesday, April 16, 2008

Retrosheet Database, part 1

By Tangotiger, 02:12 PM

With taxes finally filed, and my (now) lack of interest in forecasting systems analysis behind me, I’ll be focusing on building a Retrosheet database.  I was envisioning releasing everything I do all at once.  But, I had second thoughts about that.  I might as well release things as I do them.

So, follow along with me, and you can build your database with me, and we can finally share our SQL code once we finish building it, since we’ll all be using the exact same names for everything.  This is what you have to do to start off:




1. Download CWEVENT

http://sourceforge.net/project/showfiles.php?group_id=57086

I’m using version 0.5.1, for Windows.



2. Download all the Retrosheet event files.  I use Internet Download Accelerator:
http://westbyte.com/ida/

I think they have a 30-day free trial.  I think I paid 30$ for the full version.  It suits my needs, because I do alot of spidering.  Baseball Hacks has some perl scripts to do the same thing.  You can also go to CNET’s Download.com for other free versions.  I’ve also used GetRight.com, and that one is free I think.

I can post the XML file that IDA uses for the list of files to download, this way you simply open up this XML file, and you will download exactly the files I download.

Placeholder for Tom



3. Execute CWEVENT.exe

Again, I have a batch file, and I’ll post that when I get home.  You simply execute the batch file, and you’ll get everything extracted for you.  Put this batch file in the same folder as the .ev* files.  Name it $cwevent.bat.  The $ simply sorts it to the top.  My cwevent.exe file is in the folder noted below.  You can create your own folder scheme.

REM CWEVENT
REM
REM First line includes the header flag 
(-n).  Decide whether you need it or not.
REM C:\Retrosheet\Ted_Chadwick\cwevent --f 0-96 -x 0-50 -y 2007 2007*.ev* > all2007.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2007 2007*.ev* > all2007.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2006 2006*.ev* > all2006.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2005 2005*.ev* > all2005.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2004 2004*.ev* > all2004.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2003 2003*.ev* > all2003.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2002 2002*.ev* > all2002.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2001 2001*.ev* > all2001.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2000 2000*.ev* > all2000.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1998 1998*.ev* > all1998.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1997 1997*.ev* > all1997.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1996 1996*.ev* > all1996.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1995 1995*.ev* > all1995.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1994 1994*.ev* > all1994.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1993 1993*.ev* > all1993.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1992 1992*.ev* > all1992.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1991 1991*.ev* > all1991.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1990 1990*.ev* > all1990.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1989 1989*.ev* > all1989.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1988 1988*.ev* > all1988.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1987 1987*.ev* > all1987.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1986 1986*.ev* > all1986.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1985 1985*.ev* > all1985.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1984 1984*.ev* > all1984.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1983 1983*.ev* > all1983.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1982 1982*.ev* > all1982.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1981 1981*.ev* > all1981.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1980 1980*.ev* > all1980.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1979 1979*.ev* > all1979.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1978 1978*.ev* > all1978.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1977 1977*.ev* > all1977.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1976 1976*.ev* > all1976.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1975 1975*.ev* > all1975.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1974 1974*.ev* > all1974.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1973 1973*.ev* > all1973.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1972 1972*.ev* > all1972.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1971 1971*.ev* > all1971.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1970 1970*.ev* > all1970.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1969 1969*.ev* > all1969.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1968 1968*.ev* > all1968.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1967 1967*.ev* > all1967.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1966 1966*.ev* > all1966.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1965 1965*.ev* > all1965.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1964 1964*.ev* > all1964.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1963 1963*.ev* > all1963.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1962 1962*.ev* > all1962.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1961 1961*.ev* > all1961.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1960 1960*.ev* > all1960.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1959 1959*.ev* > all1959.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1958 1958*.ev* > all1958.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1957 1957*.ev* > all1957.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1956 1956*.ev* > all1956.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1954 1954*.ev* > all1954.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1922 1922*.ev* > all1922.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1921 1921*.ev* > all1921.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1911 1911*.ev* > all1911.csv



4. Create your table.

This is the DDL for Oracle.  Other flavors of database systems will be very similar, with likely only the VARCHAR2 needing to be changed to VARCHAR or CHAR or STRING.  So, if someone is using MySQL, maybe you can post the equivalents for INTEGER and VARCHAR2.

Note: the names are 99% solid, and this is why I didn’t want to do this piecemeal, in case I change the names slightly.

Note2: I created an extra column, which is an autosequence field.  I might end up using that as my alternate key to the database.

DROP TABLE events;

CREATE TABLE events (
    
seq_events integer

,GAME_ID VARCHAR2(12)
,
AWAY_TEAM_ID VARCHAR2(3)
,
INN_CT INTEGER
,BAT_HOME_ID INTEGER
,OUTS_CT INTEGER
,BALLS_CT INTEGER
,STRIKES_CT INTEGER
,PITCH_SEQ_TX VARCHAR2(40)
,
AWAY_SCORE_CT INTEGER
,HOME_SCORE_CT INTEGER
,BAT_ID VARCHAR2(8)
,
BAT_HAND_CD VARCHAR2(1)
,
RESP_BAT_ID VARCHAR2(8)
,
RESP_BAT_HAND_CD VARCHAR2(1)
,
PIT_ID VARCHAR2(8)
,
PIT_HAND_CD VARCHAR2(1)
,
RESP_PIT_ID VARCHAR2(8)
,
RESP_PIT_HAND_CD VARCHAR2(1)
,
POS2_FLD_ID VARCHAR2(8)
,
POS3_FLD_ID VARCHAR2(8)
,
POS4_FLD_ID VARCHAR2(8)
,
POS5_FLD_ID VARCHAR2(8)
,
POS6_FLD_ID VARCHAR2(8)
,
POS7_FLD_ID VARCHAR2(8)
,
POS8_FLD_ID VARCHAR2(8)
,
POS9_FLD_ID VARCHAR2(8)
,
BASE1_RUN_ID VARCHAR2(8)
,
BASE2_RUN_ID VARCHAR2(8)
,
BASE3_RUN_ID VARCHAR2(8)
,
EVENT_TX VARCHAR2(100)
,
LEADOFF_FL VARCHAR2(1)
,
PH_FL VARCHAR2(1)
,
BAT_FLD_CD VARCHAR2(2)
,
BAT_LINEUP_ID INTEGER
,EVENT_CD INTEGER
,BAT_EVENT_FL VARCHAR2(1)
,
AB_FL VARCHAR2(1)
,
H_CD INTEGER
,SH_FL VARCHAR2(1)
,
SF_FL VARCHAR2(1)
,
EVENT_OUTS_CT INTEGER
,DP_FL VARCHAR2(1)
,
TP_FL VARCHAR2(1)
,
RBI_CT INTEGER
,WP_FL VARCHAR2(1)
,
PB_FL VARCHAR2(1)
,
FLD_CD INTEGER
,BATTEDBALL_CD VARCHAR2(1)
,
BUNT_FL VARCHAR2(1)
,
FOUL_FL VARCHAR2(1)
,
BATTEDBALL_LOC_TX VARCHAR2(5)
,
ERR_CT INTEGER
,ERR1_FLD_CD VARCHAR2(8)
,
ERR1_CD VARCHAR2(1)
,
ERR2_FLD_CD VARCHAR2(8)
,
ERR2_CD VARCHAR2(1)
,
ERR3_FLD_CD VARCHAR2(8)
,
ERR3_CD VARCHAR2(1)
,
BAT_DEST_ID INTEGER
,RUN1_DEST_ID INTEGER
,RUN2_DEST_ID INTEGER
,RUN3_DEST_ID INTEGER
,BAT_PLAY_TX VARCHAR2(8)
,
RUN1_PLAY_TX VARCHAR2(15)
,
RUN2_PLAY_TX VARCHAR2(15)
,
RUN3_PLAY_TX VARCHAR2(15)
,
RUN1_SB_FL VARCHAR2(1)
,
RUN2_SB_FL VARCHAR2(1)
,
RUN3_SB_FL VARCHAR2(1)
,
RUN1_CS_FL VARCHAR2(1)
,
RUN2_CS_FL VARCHAR2(1)
,
RUN3_CS_FL VARCHAR2(1)
,
RUN1_PK_FL VARCHAR2(1)
,
RUN2_PK_FL VARCHAR2(1)
,
RUN3_PK_FL VARCHAR2(1)
,
RUN1_RESP_PIT_ID VARCHAR2(8)
,
RUN2_RESP_PIT_ID VARCHAR2(8)
,
RUN3_RESP_PIT_ID VARCHAR2(8)
,
GAME_NEW_FL VARCHAR2(1)
,
GAME_END_FL VARCHAR2(1)
,
PR_RUN1_FL VARCHAR2(1)
,
PR_RUN2_FL VARCHAR2(1)
,
PR_RUN3_FL VARCHAR2(1)
,
REMOVED_FOR_PR_RUN1_ID VARCHAR2(8)
,
REMOVED_FOR_PR_RUN2_ID VARCHAR2(8)
,
REMOVED_FOR_PR_RUN3_ID VARCHAR2(8)
,
REMOVED_FOR_PH_BAT_ID VARCHAR2(8)
,
REMOVED_FOR_PH_BAT_FLD_CD VARCHAR2(8)
,
PO1_FLD_CD INTEGER
,PO2_FLD_CD INTEGER
,PO3_FLD_CD INTEGER
,ASS1_FLD_CD INTEGER
,ASS2_FLD_CD INTEGER
,ASS3_FLD_CD INTEGER
,ASS4_FLD_CD INTEGER
,ASS5_FLD_CD INTEGER
,EVENT_ID INTEGER
,HOME_TEAM_ID VARCHAR2(3)
,
BAT_TEAM_ID VARCHAR2(3)
,
FLD_TEAM_ID VARCHAR2(3)
,
BAT_LAST_ID INTEGER
,INN_NEW_FL VARCHAR2(1)
,
INN_END_FL VARCHAR2(1)
,
START_BAT_SCORE_CT INTEGER
,START_FLD_SCORE_CT INTEGER
,INN_RUNS_CT INTEGER
,GAME_PA_CT INTEGER
,INN_PA_CT INTEGER
,PA_NEW_FL VARCHAR2(1)
,
PA_TRUNC_FL VARCHAR2(1)
,
START_BASES_CD INTEGER
,END_BASES_CD INTEGER
,BAT_START_FL VARCHAR2(1)
,
PIT_START_FL VARCHAR2(1)
,
RUN1_FLD_CD VARCHAR2(2)
,
RUN1_LINEUP_CD INTEGER
,RUN1_ORIGIN_EVENT_ID INTEGER
,RUN2_FLD_CD VARCHAR2(2)
,
RUN2_LINEUP_CD INTEGER
,RUN2_ORIGIN_EVENT_ID INTEGER
,RUN3_FLD_CD VARCHAR2(2)
,
RUN3_LINEUP_CD INTEGER
,RUN3_ORIGIN_EVENT_ID INTEGER
,PA_BALL_CT INTEGER
,PA_INTENT_BALL_CT INTEGER
,PA_PITCHOUT_BALL_CT INTEGER
,PA_OTHER_BALL_CT INTEGER
,PA_STRIKE_CT INTEGER
,PA_CALLED_STRIKE_CT INTEGER
,PA_SWINGMISS_STRIKE_CT INTEGER
,PA_FOUL_STRIKE_CT INTEGER
,PA_OTHER_STRIKE_CT INTEGER
,EVENT_RUNS_CT INTEGER
,FLD_ID VARCHAR2(8)
,
BASE2_FORCE_FL VARCHAR2(1)
,
BASE3_FORCE_FL VARCHAR2(1)
,
BASE4_FORCE_FL VARCHAR2(1)
,
BAT_SAFE_ERR_FL VARCHAR2(1)
,
BAT_FATE_ID INTEGER
,RUN1_FATE_ID INTEGER
,RUN2_FATE_ID INTEGER
,RUN3_FATE_ID INTEGER
,FATE_RUNS_CT INTEGER
,ASS6_FLD_CD INTEGER
,ASS7_FLD_CD INTEGER
,ASS8_FLD_CD INTEGER
,ASS9_FLD_CD INTEGER
,ASS10_FLD_CD INTEGER


);

ALTER TABLE EVENTS
        ADD CONSTRAINT PK_EVENTS 
        PRIMARY KEY 
(GAME_IDEVENT_ID)
;

commit;

You also need to create a SEQUENCE:

CREATE SEQUENCE SEQ_EVENTS
START WITH 0
INCREMENT BY 1
MINVALUE 0
NOCACHE 
NOCYCLE 
NOORDER 
;



5. Load the table.

I’ll post this when I get home, but this is where we’ll get divergence.  Oracle has its own loader, so we’ll probably have to come up with a unique implementation for whatever you guys are using.

This is the control file you need:

OPTIONS ROWS=10000bindsize=6000000ERRORS=99999 )
LOAD DATA 
INFILE 
'C:\Retro2\data\events\parsed\all2007.csv' BADFILE 'C:\Retro2\data\events\parsed\all2007.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2007.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2006.csv' BADFILE 'C:\Retro2\data\events\parsed\all2006.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2006.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2005.csv' BADFILE 'C:\Retro2\data\events\parsed\all2005.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2005.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2004.csv' BADFILE 'C:\Retro2\data\events\parsed\all2004.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2004.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2003.csv' BADFILE 'C:\Retro2\data\events\parsed\all2003.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2003.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2002.csv' BADFILE 'C:\Retro2\data\events\parsed\all2002.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2002.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2001.csv' BADFILE 'C:\Retro2\data\events\parsed\all2001.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2001.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2000.csv' BADFILE 'C:\Retro2\data\events\parsed\all2000.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2000.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1998.csv' BADFILE 'C:\Retro2\data\events\parsed\all1998.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1998.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1997.csv' BADFILE 'C:\Retro2\data\events\parsed\all1997.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1997.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1996.csv' BADFILE 'C:\Retro2\data\events\parsed\all1996.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1996.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1995.csv' BADFILE 'C:\Retro2\data\events\parsed\all1995.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1995.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1994.csv' BADFILE 'C:\Retro2\data\events\parsed\all1994.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1994.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1993.csv' BADFILE 'C:\Retro2\data\events\parsed\all1993.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1993.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1992.csv' BADFILE 'C:\Retro2\data\events\parsed\all1992.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1992.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1991.csv' BADFILE 'C:\Retro2\data\events\parsed\all1991.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1991.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1990.csv' BADFILE 'C:\Retro2\data\events\parsed\all1990.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1990.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1989.csv' BADFILE 'C:\Retro2\data\events\parsed\all1989.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1989.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1988.csv' BADFILE 'C:\Retro2\data\events\parsed\all1988.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1988.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1987.csv' BADFILE 'C:\Retro2\data\events\parsed\all1987.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1987.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1986.csv' BADFILE 'C:\Retro2\data\events\parsed\all1986.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1986.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1985.csv' BADFILE 'C:\Retro2\data\events\parsed\all1985.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1985.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1984.csv' BADFILE 'C:\Retro2\data\events\parsed\all1984.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1984.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1983.csv' BADFILE 'C:\Retro2\data\events\parsed\all1983.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1983.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1982.csv' BADFILE 'C:\Retro2\data\events\parsed\all1982.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1982.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1981.csv' BADFILE 'C:\Retro2\data\events\parsed\all1981.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1981.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1980.csv' BADFILE 'C:\Retro2\data\events\parsed\all1980.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1980.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1979.csv' BADFILE 'C:\Retro2\data\events\parsed\all1979.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1979.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1978.csv' BADFILE 'C:\Retro2\data\events\parsed\all1978.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1978.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1977.csv' BADFILE 'C:\Retro2\data\events\parsed\all1977.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1977.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1976.csv' BADFILE 'C:\Retro2\data\events\parsed\all1976.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1976.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1975.csv' BADFILE 'C:\Retro2\data\events\parsed\all1975.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1975.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1974.csv' BADFILE 'C:\Retro2\data\events\parsed\all1974.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1974.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1973.csv' BADFILE 'C:\Retro2\data\events\parsed\all1973.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1973.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1972.csv' BADFILE 'C:\Retro2\data\events\parsed\all1972.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1972.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1971.csv' BADFILE 'C:\Retro2\data\events\parsed\all1971.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1971.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1970.csv' BADFILE 'C:\Retro2\data\events\parsed\all1970.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1970.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1969.csv' BADFILE 'C:\Retro2\data\events\parsed\all1969.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1969.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1968.csv' BADFILE 'C:\Retro2\data\events\parsed\all1968.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1968.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1967.csv' BADFILE 'C:\Retro2\data\events\parsed\all1967.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1967.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1966.csv' BADFILE 'C:\Retro2\data\events\parsed\all1966.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1966.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1965.csv' BADFILE 'C:\Retro2\data\events\parsed\all1965.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1965.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1964.csv' BADFILE 'C:\Retro2\data\events\parsed\all1964.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1964.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1963.csv' BADFILE 'C:\Retro2\data\events\parsed\all1963.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1963.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1962.csv' BADFILE 'C:\Retro2\data\events\parsed\all1962.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1962.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1961.csv' BADFILE 'C:\Retro2\data\events\parsed\all1961.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1961.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1960.csv' BADFILE 'C:\Retro2\data\events\parsed\all1960.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1960.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1959.csv' BADFILE 'C:\Retro2\data\events\parsed\all1959.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1959.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1958.csv' BADFILE 'C:\Retro2\data\events\parsed\all1958.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1958.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1957.csv' BADFILE 'C:\Retro2\data\events\parsed\all1957.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1957.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1956.csv' BADFILE 'C:\Retro2\data\events\parsed\all1956.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1956.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1954.csv' BADFILE 'C:\Retro2\data\events\parsed\all1954.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1954.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1922.csv' BADFILE 'C:\Retro2\data\events\parsed\all1922.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1922.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1921.csv' BADFILE 'C:\Retro2\data\events\parsed\all1921.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1921.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1911.csv' BADFILE 'C:\Retro2\data\events\parsed\all1911.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1911.dsc'
INTO TABLE "EVENTS"

FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"' AND '"'
TRAILING NULLCOLS

  
(GAME_ID
AWAY_TEAM_ID
INN_CT
BAT_HOME_ID
OUTS_CT
BALLS_CT
STRIKES_CT
PITCH_SEQ_TX
AWAY_SCORE_CT
HOME_SCORE_CT
BAT_ID
BAT_HAND_CD
RESP_BAT_ID
RESP_BAT_HAND_CD
PIT_ID
PIT_HAND_CD
RESP_PIT_ID
RESP_PIT_HAND_CD
POS2_FLD_ID
POS3_FLD_ID
POS4_FLD_ID
POS5_FLD_ID
POS6_FLD_ID
POS7_FLD_ID
POS8_FLD_ID
POS9_FLD_ID
BASE1_RUN_ID
BASE2_RUN_ID
BASE3_RUN_ID
EVENT_TX
LEADOFF_FL
PH_FL
BAT_FLD_CD
BAT_LINEUP_ID
EVENT_CD
BAT_EVENT_FL
AB_FL
H_CD
SH_FL
SF_FL
EVENT_OUTS_CT
DP_FL
TP_FL
RBI_CT
WP_FL
PB_FL
FLD_CD
BATTEDBALL_CD
BUNT_FL
FOUL_FL
BATTEDBALL_LOC_TX
ERR_CT
ERR1_FLD_CD
ERR1_CD
ERR2_FLD_CD
ERR2_CD
ERR3_FLD_CD
ERR3_CD
BAT_DEST_ID
RUN1_DEST_ID
RUN2_DEST_ID
RUN3_DEST_ID
BAT_PLAY_TX
RUN1_PLAY_TX
RUN2_PLAY_TX
RUN3_PLAY_TX
RUN1_SB_FL
RUN2_SB_FL
RUN3_SB_FL
RUN1_CS_FL
RUN2_CS_FL
RUN3_CS_FL
RUN1_PK_FL
RUN2_PK_FL
RUN3_PK_FL
RUN1_RESP_PIT_ID
RUN2_RESP_PIT_ID
RUN3_RESP_PIT_ID
GAME_NEW_FL
GAME_END_FL
PR_RUN1_FL
PR_RUN2_FL
PR_RUN3_FL
REMOVED_FOR_PR_RUN1_ID
REMOVED_FOR_PR_RUN2_ID
REMOVED_FOR_PR_RUN3_ID
REMOVED_FOR_PH_BAT_ID
REMOVED_FOR_PH_BAT_FLD_CD
PO1_FLD_CD
PO2_FLD_CD
PO3_FLD_CD
ASS1_FLD_CD
ASS2_FLD_CD
ASS3_FLD_CD
ASS4_FLD_CD
ASS5_FLD_CD
EVENT_ID
HOME_TEAM_ID
BAT_TEAM_ID
FLD_TEAM_ID
BAT_LAST_ID
INN_NEW_FL
INN_END_FL
START_BAT_SCORE_CT
START_FLD_SCORE_CT
INN_RUNS_CT
GAME_PA_CT
INN_PA_CT
PA_NEW_FL
PA_TRUNC_FL
START_BASES_CD
END_BASES_CD
BAT_START_FL
PIT_START_FL
RUN1_FLD_CD
RUN1_LINEUP_CD
RUN1_ORIGIN_EVENT_ID
RUN2_FLD_CD
RUN2_LINEUP_CD
RUN2_ORIGIN_EVENT_ID
RUN3_FLD_CD
RUN3_LINEUP_CD
RUN3_ORIGIN_EVENT_ID
PA_BALL_CT
PA_INTENT_BALL_CT
PA_PITCHOUT_BALL_CT
PA_OTHER_BALL_CT
PA_STRIKE_CT
PA_CALLED_STRIKE_CT
PA_SWINGMISS_STRIKE_CT
PA_FOUL_STRIKE_CT
PA_OTHER_STRIKE_CT
EVENT_RUNS_CT
FLD_ID
BASE2_FORCE_FL
BASE3_FORCE_FL
BASE4_FORCE_FL
BAT_SAFE_ERR_FL
BAT_FATE_ID
RUN1_FATE_ID
RUN2_FATE_ID
RUN3_FATE_ID
FATE_RUNS_CT
ASS6_FLD_CD
ASS7_FLD_CD
ASS8_FLD_CD
ASS9_FLD_CD
ASS10_FLD_CD

SEQ_EVENTS "SEQ_EVENTS.NEXTVAL"
)

And this is the batch file that will execute the load:

"c:\oracle\ora92\bin\sqlldr" userid=myname/mypass control="C:\Retrosheet\chadwick\control\events.ctl"

Where myname and mypass is the username and password of your database user.



This is what my Windows directory looks like, if you want to set it up exactly as I do:

C:
   
Retro2
      data
         events
            parsed
            unzipped
            zipped
   Retrosheet
      chadwick
         control
         programs
         sql
      Ted_Chadwick



No guarantee that I won’t change things around, nor that I’ll update anything I do.  This is all as-is, with no promise of support.  Basically, you get to intermittently look over my shoulder.

SabermetricsData
#1          (see all posts) 2008/04/16 (Wed) @ 16:45

Why don’t you just do what was in Mark Adler’s book?  He provides code and the way he does it the database will stay up to date each year and there are less manual steps.  Or at least use it as a starting point.  He also used the simpler for beginners and free MySQL.


#2    Tangotiger      (see all posts) 2008/04/16 (Wed) @ 16:59

Joe Adler.  I’ve hyped Joe’s book several times in my blog, and recommend it to anyone.

***

This blog entry is a result of this hypotethetical exchange:

Fan: “Tango, can I see your work?”
Me: “Sure.”

It does *not* include this exchange:

Fan: “Uh, great.  Now that you opened it up for me, can you change your database, and make your processes more automated?  And make sure it updates annually.”
Me: “Sure.”


#3    Colin Wyers      (see all posts) 2008/04/16 (Wed) @ 18:16

Looks great so far. I use MySQL at home, and so when I get off work tonight I should be able to provide anyone that wants it with the “conversions.”


#4    Tangotiger      (see all posts) 2008/04/16 (Wed) @ 21:31

I made updates to the main blog entry.  I might just post links to the files instead of embedding the text within the blog entry.


#5    Tangotiger      (see all posts) 2008/04/16 (Wed) @ 22:30

I noticed a couple of fields I made a string, but they should be integers.  Such is the curse of my releasing beta-instructions.

I suggest that if you load your database, only do it with the 2007 data.  Things will change.


#6    Matt Mitchell      (see all posts) 2008/04/16 (Wed) @ 23:00

Tom,

Since I clearly don’t know everything, I’m assuming you’re using a Mac or other UNIX-based system, correct? Just want to clear before I comment with the Windows-based solutions I’m working with, though most of them appear to be in Joe’s book in some form.


#7    Colin Wyers      (see all posts) 2008/04/16 (Wed) @ 23:04

He’s using batch files, so that means he’s using DOS/Windows. He’s just using Oracle, a heavy-duty database far beyond the reach (or needs, really) of most of us mortals. That said, everything he uses to generate the CSV files with Chadwick will apply to any database system (even Access, if you’d like.) It’s getting those CSV files into the database that won’t work as-is on other systems.


#8    tangotiger      (see all posts) 2008/04/16 (Wed) @ 23:13

I’m using Windows, though all this should be portable to *NIX.

There are basic free versions of Oracle available, like Oracle Express.


#9    Colin Wyers      (see all posts) 2008/04/16 (Wed) @ 23:27

Apparently so.

“Oracle Database XE can be installed on any size host machine with any number of CPUs (one database per machine), but XE will store up to 4GB of user data, use up to 1GB of memory, and use one CPU on the host machine.”

If all you’re using it for is one Retrosheet database, I can’t see you running into the 4gb limit. If you have multiple CPUs or more than a gig of RAM, you may notice a performance decline, I suppose.


#10    Matt Mitchell      (see all posts) 2008/04/17 (Thu) @ 00:46

Thanks for the clarification. I too, like Colin, am using MySQL to try and load the event files. I also am using the Perl scripts from Baseball Hacks, available from the book’s website. (I’m still illiterate as to adding a link here, so feel free to do so for me).

Colin, what challenges did you run into with MySQL?


#11    Colin Wyers      (see all posts) 2008/04/17 (Thu) @ 01:57

Matt: I used BEVENT and a different set of definitions. I’ve done my best to duplicate Tango’s database definition in MySQL. Here goes:

CREATE TABLE events (
SEQ_EVENTS INTEGER NOT NULL UNIQUE KEY AUTO_INCREMENT
,GAME_ID VARCHAR(12)
,AWAY_TEAM_ID VARCHAR(3)
,INN_CT INTEGER
,BAT_HOME_ID INTEGER
,OUTS_CT INTEGER
,BALLS_CT INTEGER
,STRIKES_CT INTEGER
,PITCH_SEQ_TX VARCHAR(40)
,AWAY_SCORE_CT INTEGER
,HOME_SCORE_CT INTEGER
,BAT_ID VARCHAR(8)
,BAT_HAND_CD VARCHAR(1)
,RESP_BAT_ID VARCHAR(8)
,RESP_BAT_HAND_CD VARCHAR(1)
,PIT_ID VARCHAR(8)
,PIT_HAND_CD VARCHAR(1)
,RESP_PIT_ID VARCHAR(8)
,RESP_PIT_HAND_CD VARCHAR(1)
,POS2_FLD_ID VARCHAR(8)
,POS3_FLD_ID VARCHAR(8)
,POS4_FLD_ID VARCHAR(8)
,POS5_FLD_ID VARCHAR(8)
,POS6_FLD_ID VARCHAR(8)
,POS7_FLD_ID VARCHAR(8)
,POS8_FLD_ID VARCHAR(8)
,POS9_FLD_ID VARCHAR(8)
,BASE1_RUN_ID VARCHAR(8)
,BASE2_RUN_ID VARCHAR(8)
,BASE3_RUN_ID VARCHAR(8)
,EVENT_TX VARCHAR(100)
,LEADOFF_FL VARCHAR(1)
,PH_FL VARCHAR(1)
,BAT_FLD_CD VARCHAR(2)
,BAT_LINEUP_ID INTEGER
,EVENT_CD INTEGER
,BAT_EVENT_FL VARCHAR(1)
,AB_FL VARCHAR(1)
,H_CD INTEGER
,SH_FL VARCHAR(1)
,SF_FL VARCHAR(1)
,EVENT_OUTS_CT INTEGER
,DP_FL VARCHAR(1)
,TP_FL VARCHAR(1)
,RBI_CT INTEGER
,WP_FL VARCHAR(1)
,PB_FL VARCHAR(1)
,FLD_CD INTEGER
,BATTEDBALL_CD VARCHAR(1)
,BUNT_FL VARCHAR(1)
,FOUL_FL VARCHAR(1)
,BATTEDBALL_LOC_TX VARCHAR(5)
,ERR_CT INTEGER
,ERR1_FLD_CD VARCHAR(8)
,ERR1_CD VARCHAR(1)
,ERR2_FLD_CD VARCHAR(8)
,ERR2_CD VARCHAR(1)
,ERR3_FLD_CD VARCHAR(8)
,ERR3_CD VARCHAR(1)
,BAT_DEST_ID INTEGER
,RUN1_DEST_ID INTEGER
,RUN2_DEST_ID INTEGER
,RUN3_DEST_ID INTEGER
,BAT_PLAY_TX VARCHAR(8)
,RUN1_PLAY_TX VARCHAR(15)
,RUN2_PLAY_TX VARCHAR(15)
,RUN3_PLAY_TX VARCHAR(15)
,RUN1_SB_FL VARCHAR(1)
,RUN2_SB_FL VARCHAR(1)
,RUN3_SB_FL VARCHAR(1)
,RUN1_CS_FL VARCHAR(1)
,RUN2_CS_FL VARCHAR(1)
,RUN3_CS_FL VARCHAR(1)
,RUN1_PK_FL VARCHAR(1)
,RUN2_PK_FL VARCHAR(1)
,RUN3_PK_FL VARCHAR(1)
,RUN1_RESP_PIT_ID VARCHAR(8)
,RUN2_RESP_PIT_ID VARCHAR(8)
,RUN3_RESP_PIT_ID VARCHAR(8)
,GAME_NEW_FL VARCHAR(1)
,GAME_END_FL VARCHAR(1)
,PR_RUN1_FL VARCHAR(1)
,PR_RUN2_FL VARCHAR(1)
,PR_RUN3_FL VARCHAR(1)
,REMOVED_FOR_PR_RUN1_ID VARCHAR(8)
,REMOVED_FOR_PR_RUN2_ID VARCHAR(8)
,REMOVED_FOR_PR_RUN3_ID VARCHAR(8)
,REMOVED_FOR_PH_BAT_ID VARCHAR(8)
,REMOVED_FOR_PH_BAT_FLD_CD VARCHAR(8)
,PO1_FLD_CD INTEGER
,PO2_FLD_CD INTEGER
,PO3_FLD_CD INTEGER
,ASS1_FLD_CD INTEGER
,ASS2_FLD_CD INTEGER
,ASS3_FLD_CD INTEGER
,ASS4_FLD_CD INTEGER
,ASS5_FLD_CD INTEGER
,EVENT_ID INTEGER
,HOME_TEAM_ID VARCHAR(3)
,BAT_TEAM_ID VARCHAR(3)
,FLD_TEAM_ID VARCHAR(3)
,BAT_LAST_ID INTEGER
,INN_NEW_FL VARCHAR(1)
,INN_END_FL VARCHAR(1)
,START_BAT_SCORE_CT INTEGER
,START_FLD_SCORE_CT INTEGER
,INN_RUNS_CT INTEGER
,GAME_PA_CT INTEGER
,INN_PA_CT INTEGER
,PA_NEW_FL VARCHAR(1)
,PA_TRUNC_FL VARCHAR(1)
,START_BASES_CD INTEGER
,END_BASES_CD INTEGER
,BAT_START_FL VARCHAR(1)
,PIT_START_FL VARCHAR(1)
,RUN1_FLD_CD VARCHAR(2)
,RUN1_LINEUP_CD INTEGER
,RUN1_ORIGIN_EVENT_ID INTEGER
,RUN2_FLD_CD VARCHAR(2)
,RUN2_LINEUP_CD INTEGER
,RUN2_ORIGIN_EVENT_ID INTEGER
,RUN3_FLD_CD VARCHAR(2)
,RUN3_LINEUP_CD INTEGER
,RUN3_ORIGIN_EVENT_ID INTEGER
,PA_BALL_CT INTEGER
,PA_INTENT_BALL_CT INTEGER
,PA_PITCHOUT_BALL_CT INTEGER
,PA_OTHER_BALL_CT INTEGER
,PA_STRIKE_CT INTEGER
,PA_CALLED_STRIKE_CT INTEGER
,PA_SWINGMISS_STRIKE_CT INTEGER
,PA_FOUL_STRIKE_CT INTEGER
,PA_OTHER_STRIKE_CT INTEGER
,EVENT_RUNS_CT INTEGER
,FLD_ID VARCHAR(8)
,BASE2_FORCE_FL VARCHAR(1)
,BASE3_FORCE_FL VARCHAR(1)
,BASE4_FORCE_FL VARCHAR(1)