THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Wednesday, April 16, 2008

Retrosheet Database, part 1

By Tangotiger, 02:12 PM

With taxes finally filed, and my (now) lack of interest in forecasting systems analysis behind me, I’ll be focusing on building a Retrosheet database.  I was envisioning releasing everything I do all at once.  But, I had second thoughts about that.  I might as well release things as I do them.

So, follow along with me, and you can build your database with me, and we can finally share our SQL code once we finish building it, since we’ll all be using the exact same names for everything.  This is what you have to do to start off:




1. Download CWEVENT

http://sourceforge.net/project/showfiles.php?group_id=57086

I’m using version 0.5.1, for Windows.



2. Download all the Retrosheet event files.  I use Internet Download Accelerator:
http://westbyte.com/ida/

I think they have a 30-day free trial.  I think I paid 30$ for the full version.  It suits my needs, because I do alot of spidering.  Baseball Hacks has some perl scripts to do the same thing.  You can also go to CNET’s Download.com for other free versions.  I’ve also used GetRight.com, and that one is free I think.

I can post the XML file that IDA uses for the list of files to download, this way you simply open up this XML file, and you will download exactly the files I download.

Placeholder for Tom



3. Execute CWEVENT.exe

Again, I have a batch file, and I’ll post that when I get home.  You simply execute the batch file, and you’ll get everything extracted for you.  Put this batch file in the same folder as the .ev* files.  Name it $cwevent.bat.  The $ simply sorts it to the top.  My cwevent.exe file is in the folder noted below.  You can create your own folder scheme.

REM CWEVENT
REM
REM First line includes the header flag 
(-n).  Decide whether you need it or not.
REM C:\Retrosheet\Ted_Chadwick\cwevent --f 0-96 -x 0-50 -y 2007 2007*.ev* > all2007.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2007 2007*.ev* > all2007.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2006 2006*.ev* > all2006.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2005 2005*.ev* > all2005.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2004 2004*.ev* > all2004.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2003 2003*.ev* > all2003.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2002 2002*.ev* > all2002.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2001 2001*.ev* > all2001.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 2000 2000*.ev* > all2000.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1998 1998*.ev* > all1998.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1997 1997*.ev* > all1997.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1996 1996*.ev* > all1996.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1995 1995*.ev* > all1995.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1994 1994*.ev* > all1994.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1993 1993*.ev* > all1993.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1992 1992*.ev* > all1992.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1991 1991*.ev* > all1991.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1990 1990*.ev* > all1990.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1989 1989*.ev* > all1989.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1988 1988*.ev* > all1988.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1987 1987*.ev* > all1987.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1986 1986*.ev* > all1986.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1985 1985*.ev* > all1985.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1984 1984*.ev* > all1984.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1983 1983*.ev* > all1983.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1982 1982*.ev* > all1982.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1981 1981*.ev* > all1981.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1980 1980*.ev* > all1980.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1979 1979*.ev* > all1979.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1978 1978*.ev* > all1978.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1977 1977*.ev* > all1977.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1976 1976*.ev* > all1976.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1975 1975*.ev* > all1975.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1974 1974*.ev* > all1974.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1973 1973*.ev* > all1973.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1972 1972*.ev* > all1972.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1971 1971*.ev* > all1971.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1970 1970*.ev* > all1970.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1969 1969*.ev* > all1969.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1968 1968*.ev* > all1968.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1967 1967*.ev* > all1967.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1966 1966*.ev* > all1966.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1965 1965*.ev* > all1965.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1964 1964*.ev* > all1964.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1963 1963*.ev* > all1963.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1962 1962*.ev* > all1962.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1961 1961*.ev* > all1961.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1960 1960*.ev* > all1960.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1959 1959*.ev* > all1959.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1958 1958*.ev* > all1958.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1957 1957*.ev* > all1957.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1956 1956*.ev* > all1956.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1954 1954*.ev* > all1954.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1922 1922*.ev* > all1922.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1921 1921*.ev* > all1921.csv
C
:\Retrosheet\Ted_Chadwick\cwevent -f 0-96 -x 0-50 -y 1911 1911*.ev* > all1911.csv



4. Create your table.

This is the DDL for Oracle.  Other flavors of database systems will be very similar, with likely only the VARCHAR2 needing to be changed to VARCHAR or CHAR or STRING.  So, if someone is using MySQL, maybe you can post the equivalents for INTEGER and VARCHAR2.

Note: the names are 99% solid, and this is why I didn’t want to do this piecemeal, in case I change the names slightly.

Note2: I created an extra column, which is an autosequence field.  I might end up using that as my alternate key to the database.

DROP TABLE events;

CREATE TABLE events (
    
seq_events integer

,GAME_ID VARCHAR2(12)
,
AWAY_TEAM_ID VARCHAR2(3)
,
INN_CT INTEGER
,BAT_HOME_ID INTEGER
,OUTS_CT INTEGER
,BALLS_CT INTEGER
,STRIKES_CT INTEGER
,PITCH_SEQ_TX VARCHAR2(40)
,
AWAY_SCORE_CT INTEGER
,HOME_SCORE_CT INTEGER
,BAT_ID VARCHAR2(8)
,
BAT_HAND_CD VARCHAR2(1)
,
RESP_BAT_ID VARCHAR2(8)
,
RESP_BAT_HAND_CD VARCHAR2(1)
,
PIT_ID VARCHAR2(8)
,
PIT_HAND_CD VARCHAR2(1)
,
RESP_PIT_ID VARCHAR2(8)
,
RESP_PIT_HAND_CD VARCHAR2(1)
,
POS2_FLD_ID VARCHAR2(8)
,
POS3_FLD_ID VARCHAR2(8)
,
POS4_FLD_ID VARCHAR2(8)
,
POS5_FLD_ID VARCHAR2(8)
,
POS6_FLD_ID VARCHAR2(8)
,
POS7_FLD_ID VARCHAR2(8)
,
POS8_FLD_ID VARCHAR2(8)
,
POS9_FLD_ID VARCHAR2(8)
,
BASE1_RUN_ID VARCHAR2(8)
,
BASE2_RUN_ID VARCHAR2(8)
,
BASE3_RUN_ID VARCHAR2(8)
,
EVENT_TX VARCHAR2(100)
,
LEADOFF_FL VARCHAR2(1)
,
PH_FL VARCHAR2(1)
,
BAT_FLD_CD VARCHAR2(2)
,
BAT_LINEUP_ID INTEGER
,EVENT_CD INTEGER
,BAT_EVENT_FL VARCHAR2(1)
,
AB_FL VARCHAR2(1)
,
H_CD INTEGER
,SH_FL VARCHAR2(1)
,
SF_FL VARCHAR2(1)
,
EVENT_OUTS_CT INTEGER
,DP_FL VARCHAR2(1)
,
TP_FL VARCHAR2(1)
,
RBI_CT INTEGER
,WP_FL VARCHAR2(1)
,
PB_FL VARCHAR2(1)
,
FLD_CD INTEGER
,BATTEDBALL_CD VARCHAR2(1)
,
BUNT_FL VARCHAR2(1)
,
FOUL_FL VARCHAR2(1)
,
BATTEDBALL_LOC_TX VARCHAR2(5)
,
ERR_CT INTEGER
,ERR1_FLD_CD VARCHAR2(8)
,
ERR1_CD VARCHAR2(1)
,
ERR2_FLD_CD VARCHAR2(8)
,
ERR2_CD VARCHAR2(1)
,
ERR3_FLD_CD VARCHAR2(8)
,
ERR3_CD VARCHAR2(1)
,
BAT_DEST_ID INTEGER
,RUN1_DEST_ID INTEGER
,RUN2_DEST_ID INTEGER
,RUN3_DEST_ID INTEGER
,BAT_PLAY_TX VARCHAR2(8)
,
RUN1_PLAY_TX VARCHAR2(15)
,
RUN2_PLAY_TX VARCHAR2(15)
,
RUN3_PLAY_TX VARCHAR2(15)
,
RUN1_SB_FL VARCHAR2(1)
,
RUN2_SB_FL VARCHAR2(1)
,
RUN3_SB_FL VARCHAR2(1)
,
RUN1_CS_FL VARCHAR2(1)
,
RUN2_CS_FL VARCHAR2(1)
,
RUN3_CS_FL VARCHAR2(1)
,
RUN1_PK_FL VARCHAR2(1)
,
RUN2_PK_FL VARCHAR2(1)
,
RUN3_PK_FL VARCHAR2(1)
,
RUN1_RESP_PIT_ID VARCHAR2(8)
,
RUN2_RESP_PIT_ID VARCHAR2(8)
,
RUN3_RESP_PIT_ID VARCHAR2(8)
,
GAME_NEW_FL VARCHAR2(1)
,
GAME_END_FL VARCHAR2(1)
,
PR_RUN1_FL VARCHAR2(1)
,
PR_RUN2_FL VARCHAR2(1)
,
PR_RUN3_FL VARCHAR2(1)
,
REMOVED_FOR_PR_RUN1_ID VARCHAR2(8)
,
REMOVED_FOR_PR_RUN2_ID VARCHAR2(8)
,
REMOVED_FOR_PR_RUN3_ID VARCHAR2(8)
,
REMOVED_FOR_PH_BAT_ID VARCHAR2(8)
,
REMOVED_FOR_PH_BAT_FLD_CD VARCHAR2(8)
,
PO1_FLD_CD INTEGER
,PO2_FLD_CD INTEGER
,PO3_FLD_CD INTEGER
,ASS1_FLD_CD INTEGER
,ASS2_FLD_CD INTEGER
,ASS3_FLD_CD INTEGER
,ASS4_FLD_CD INTEGER
,ASS5_FLD_CD INTEGER
,EVENT_ID INTEGER
,HOME_TEAM_ID VARCHAR2(3)
,
BAT_TEAM_ID VARCHAR2(3)
,
FLD_TEAM_ID VARCHAR2(3)
,
BAT_LAST_ID INTEGER
,INN_NEW_FL VARCHAR2(1)
,
INN_END_FL VARCHAR2(1)
,
START_BAT_SCORE_CT INTEGER
,START_FLD_SCORE_CT INTEGER
,INN_RUNS_CT INTEGER
,GAME_PA_CT INTEGER
,INN_PA_CT INTEGER
,PA_NEW_FL VARCHAR2(1)
,
PA_TRUNC_FL VARCHAR2(1)
,
START_BASES_CD INTEGER
,END_BASES_CD INTEGER
,BAT_START_FL VARCHAR2(1)
,
PIT_START_FL VARCHAR2(1)
,
RUN1_FLD_CD VARCHAR2(2)
,
RUN1_LINEUP_CD INTEGER
,RUN1_ORIGIN_EVENT_ID INTEGER
,RUN2_FLD_CD VARCHAR2(2)
,
RUN2_LINEUP_CD INTEGER
,RUN2_ORIGIN_EVENT_ID INTEGER
,RUN3_FLD_CD VARCHAR2(2)
,
RUN3_LINEUP_CD INTEGER
,RUN3_ORIGIN_EVENT_ID INTEGER
,PA_BALL_CT INTEGER
,PA_INTENT_BALL_CT INTEGER
,PA_PITCHOUT_BALL_CT INTEGER
,PA_OTHER_BALL_CT INTEGER
,PA_STRIKE_CT INTEGER
,PA_CALLED_STRIKE_CT INTEGER
,PA_SWINGMISS_STRIKE_CT INTEGER
,PA_FOUL_STRIKE_CT INTEGER
,PA_OTHER_STRIKE_CT INTEGER
,EVENT_RUNS_CT INTEGER
,FLD_ID VARCHAR2(8)
,
BASE2_FORCE_FL VARCHAR2(1)
,
BASE3_FORCE_FL VARCHAR2(1)
,
BASE4_FORCE_FL VARCHAR2(1)
,
BAT_SAFE_ERR_FL VARCHAR2(1)
,
BAT_FATE_ID INTEGER
,RUN1_FATE_ID INTEGER
,RUN2_FATE_ID INTEGER
,RUN3_FATE_ID INTEGER
,FATE_RUNS_CT INTEGER
,ASS6_FLD_CD INTEGER
,ASS7_FLD_CD INTEGER
,ASS8_FLD_CD INTEGER
,ASS9_FLD_CD INTEGER
,ASS10_FLD_CD INTEGER


);

ALTER TABLE EVENTS
        ADD CONSTRAINT PK_EVENTS 
        PRIMARY KEY 
(GAME_IDEVENT_ID)
;

commit;

You also need to create a SEQUENCE:

CREATE SEQUENCE SEQ_EVENTS
START WITH 0
INCREMENT BY 1
MINVALUE 0
NOCACHE 
NOCYCLE 
NOORDER 
;



5. Load the table.

I’ll post this when I get home, but this is where we’ll get divergence.  Oracle has its own loader, so we’ll probably have to come up with a unique implementation for whatever you guys are using.

This is the control file you need:

OPTIONS ROWS=10000bindsize=6000000ERRORS=99999 )
LOAD DATA 
INFILE 
'C:\Retro2\data\events\parsed\all2007.csv' BADFILE 'C:\Retro2\data\events\parsed\all2007.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2007.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2006.csv' BADFILE 'C:\Retro2\data\events\parsed\all2006.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2006.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2005.csv' BADFILE 'C:\Retro2\data\events\parsed\all2005.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2005.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2004.csv' BADFILE 'C:\Retro2\data\events\parsed\all2004.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2004.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2003.csv' BADFILE 'C:\Retro2\data\events\parsed\all2003.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2003.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2002.csv' BADFILE 'C:\Retro2\data\events\parsed\all2002.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2002.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2001.csv' BADFILE 'C:\Retro2\data\events\parsed\all2001.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2001.dsc'
INFILE 'C:\Retro2\data\events\parsed\all2000.csv' BADFILE 'C:\Retro2\data\events\parsed\all2000.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all2000.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1998.csv' BADFILE 'C:\Retro2\data\events\parsed\all1998.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1998.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1997.csv' BADFILE 'C:\Retro2\data\events\parsed\all1997.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1997.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1996.csv' BADFILE 'C:\Retro2\data\events\parsed\all1996.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1996.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1995.csv' BADFILE 'C:\Retro2\data\events\parsed\all1995.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1995.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1994.csv' BADFILE 'C:\Retro2\data\events\parsed\all1994.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1994.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1993.csv' BADFILE 'C:\Retro2\data\events\parsed\all1993.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1993.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1992.csv' BADFILE 'C:\Retro2\data\events\parsed\all1992.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1992.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1991.csv' BADFILE 'C:\Retro2\data\events\parsed\all1991.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1991.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1990.csv' BADFILE 'C:\Retro2\data\events\parsed\all1990.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1990.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1989.csv' BADFILE 'C:\Retro2\data\events\parsed\all1989.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1989.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1988.csv' BADFILE 'C:\Retro2\data\events\parsed\all1988.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1988.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1987.csv' BADFILE 'C:\Retro2\data\events\parsed\all1987.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1987.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1986.csv' BADFILE 'C:\Retro2\data\events\parsed\all1986.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1986.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1985.csv' BADFILE 'C:\Retro2\data\events\parsed\all1985.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1985.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1984.csv' BADFILE 'C:\Retro2\data\events\parsed\all1984.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1984.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1983.csv' BADFILE 'C:\Retro2\data\events\parsed\all1983.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1983.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1982.csv' BADFILE 'C:\Retro2\data\events\parsed\all1982.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1982.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1981.csv' BADFILE 'C:\Retro2\data\events\parsed\all1981.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1981.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1980.csv' BADFILE 'C:\Retro2\data\events\parsed\all1980.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1980.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1979.csv' BADFILE 'C:\Retro2\data\events\parsed\all1979.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1979.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1978.csv' BADFILE 'C:\Retro2\data\events\parsed\all1978.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1978.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1977.csv' BADFILE 'C:\Retro2\data\events\parsed\all1977.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1977.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1976.csv' BADFILE 'C:\Retro2\data\events\parsed\all1976.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1976.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1975.csv' BADFILE 'C:\Retro2\data\events\parsed\all1975.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1975.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1974.csv' BADFILE 'C:\Retro2\data\events\parsed\all1974.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1974.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1973.csv' BADFILE 'C:\Retro2\data\events\parsed\all1973.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1973.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1972.csv' BADFILE 'C:\Retro2\data\events\parsed\all1972.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1972.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1971.csv' BADFILE 'C:\Retro2\data\events\parsed\all1971.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1971.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1970.csv' BADFILE 'C:\Retro2\data\events\parsed\all1970.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1970.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1969.csv' BADFILE 'C:\Retro2\data\events\parsed\all1969.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1969.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1968.csv' BADFILE 'C:\Retro2\data\events\parsed\all1968.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1968.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1967.csv' BADFILE 'C:\Retro2\data\events\parsed\all1967.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1967.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1966.csv' BADFILE 'C:\Retro2\data\events\parsed\all1966.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1966.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1965.csv' BADFILE 'C:\Retro2\data\events\parsed\all1965.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1965.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1964.csv' BADFILE 'C:\Retro2\data\events\parsed\all1964.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1964.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1963.csv' BADFILE 'C:\Retro2\data\events\parsed\all1963.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1963.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1962.csv' BADFILE 'C:\Retro2\data\events\parsed\all1962.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1962.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1961.csv' BADFILE 'C:\Retro2\data\events\parsed\all1961.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1961.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1960.csv' BADFILE 'C:\Retro2\data\events\parsed\all1960.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1960.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1959.csv' BADFILE 'C:\Retro2\data\events\parsed\all1959.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1959.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1958.csv' BADFILE 'C:\Retro2\data\events\parsed\all1958.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1958.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1957.csv' BADFILE 'C:\Retro2\data\events\parsed\all1957.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1957.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1956.csv' BADFILE 'C:\Retro2\data\events\parsed\all1956.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1956.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1954.csv' BADFILE 'C:\Retro2\data\events\parsed\all1954.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1954.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1922.csv' BADFILE 'C:\Retro2\data\events\parsed\all1922.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1922.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1921.csv' BADFILE 'C:\Retro2\data\events\parsed\all1921.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1921.dsc'
INFILE 'C:\Retro2\data\events\parsed\all1911.csv' BADFILE 'C:\Retro2\data\events\parsed\all1911.bad' DISCARDFILE 'C:\Retro2\data\events\parsed\all1911.dsc'
INTO TABLE "EVENTS"

FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"' AND '"'
TRAILING NULLCOLS

  
(GAME_ID
AWAY_TEAM_ID
INN_CT
BAT_HOME_ID
OUTS_CT
BALLS_CT
STRIKES_CT
PITCH_SEQ_TX
AWAY_SCORE_CT
HOME_SCORE_CT
BAT_ID
BAT_HAND_CD
RESP_BAT_ID
RESP_BAT_HAND_CD
PIT_ID
PIT_HAND_CD
RESP_PIT_ID
RESP_PIT_HAND_CD
POS2_FLD_ID
POS3_FLD_ID
POS4_FLD_ID
POS5_FLD_ID
POS6_FLD_ID
POS7_FLD_ID
POS8_FLD_ID
POS9_FLD_ID
BASE1_RUN_ID
BASE2_RUN_ID
BASE3_RUN_ID
EVENT_TX
LEADOFF_FL
PH_FL
BAT_FLD_CD
BAT_LINEUP_ID
EVENT_CD
BAT_EVENT_FL
AB_FL
H_CD
SH_FL
SF_FL
EVENT_OUTS_CT
DP_FL
TP_FL
RBI_CT
WP_FL
PB_FL
FLD_CD
BATTEDBALL_CD
BUNT_FL
FOUL_FL
BATTEDBALL_LOC_TX
ERR_CT
ERR1_FLD_CD
ERR1_CD
ERR2_FLD_CD
ERR2_CD
ERR3_FLD_CD
ERR3_CD
BAT_DEST_ID
RUN1_DEST_ID
RUN2_DEST_ID
RUN3_DEST_ID
BAT_PLAY_TX
RUN1_PLAY_TX
RUN2_PLAY_TX
RUN3_PLAY_TX
RUN1_SB_FL
RUN2_SB_FL
RUN3_SB_FL
RUN1_CS_FL
RUN2_CS_FL
RUN3_CS_FL
RUN1_PK_FL
RUN2_PK_FL
RUN3_PK_FL
RUN1_RESP_PIT_ID
RUN2_RESP_PIT_ID
RUN3_RESP_PIT_ID
GAME_NEW_FL
GAME_END_FL
PR_RUN1_FL
PR_RUN2_FL
PR_RUN3_FL
REMOVED_FOR_PR_RUN1_ID
REMOVED_FOR_PR_RUN2_ID
REMOVED_FOR_PR_RUN3_ID
REMOVED_FOR_PH_BAT_ID
REMOVED_FOR_PH_BAT_FLD_CD
PO1_FLD_CD
PO2_FLD_CD
PO3_FLD_CD
ASS1_FLD_CD
ASS2_FLD_CD
ASS3_FLD_CD
ASS4_FLD_CD
ASS5_FLD_CD
EVENT_ID
HOME_TEAM_ID
BAT_TEAM_ID
FLD_TEAM_ID
BAT_LAST_ID
INN_NEW_FL
INN_END_FL
START_BAT_SCORE_CT
START_FLD_SCORE_CT
INN_RUNS_CT
GAME_PA_CT
INN_PA_CT
PA_NEW_FL
PA_TRUNC_FL
START_BASES_CD
END_BASES_CD
BAT_START_FL
PIT_START_FL
RUN1_FLD_CD
RUN1_LINEUP_CD
RUN1_ORIGIN_EVENT_ID
RUN2_FLD_CD
RUN2_LINEUP_CD
RUN2_ORIGIN_EVENT_ID
RUN3_FLD_CD
RUN3_LINEUP_CD
RUN3_ORIGIN_EVENT_ID
PA_BALL_CT
PA_INTENT_BALL_CT
PA_PITCHOUT_BALL_CT
PA_OTHER_BALL_CT
PA_STRIKE_CT
PA_CALLED_STRIKE_CT
PA_SWINGMISS_STRIKE_CT
PA_FOUL_STRIKE_CT
PA_OTHER_STRIKE_CT
EVENT_RUNS_CT
FLD_ID
BASE2_FORCE_FL
BASE3_FORCE_FL
BASE4_FORCE_FL
BAT_SAFE_ERR_FL
BAT_FATE_ID
RUN1_FATE_ID
RUN2_FATE_ID
RUN3_FATE_ID
FATE_RUNS_CT
ASS6_FLD_CD
ASS7_FLD_CD
ASS8_FLD_CD
ASS9_FLD_CD
ASS10_FLD_CD

SEQ_EVENTS "SEQ_EVENTS.NEXTVAL"
)

And this is the batch file that will execute the load:

"c:\oracle\ora92\bin\sqlldr" userid=myname/mypass control="C:\Retrosheet\chadwick\control\events.ctl"

Where myname and mypass is the username and password of your database user.



This is what my Windows directory looks like, if you want to set it up exactly as I do:

C:
   
Retro2
      data
         events
            parsed
            unzipped
            zipped
   Retrosheet
      chadwick
         control
         programs
         sql
      Ted_Chadwick



No guarantee that I won’t change things around, nor that I’ll update anything I do.  This is all as-is, with no promise of support.  Basically, you get to intermittently look over my shoulder.

SabermetricsData
#1          (see all posts) 2008/04/16 (Wed) @ 16:45

Why don’t you just do what was in Mark Adler’s book?  He provides code and the way he does it the database will stay up to date each year and there are less manual steps.  Or at least use it as a starting point.  He also used the simpler for beginners and free MySQL.


#2    Tangotiger      (see all posts) 2008/04/16 (Wed) @ 16:59

Joe Adler.  I’ve hyped Joe’s book several times in my blog, and recommend it to anyone.

***

This blog entry is a result of this hypotethetical exchange:

Fan: “Tango, can I see your work?”
Me: “Sure.”

It does *not* include this exchange:

Fan: “Uh, great.  Now that you opened it up for me, can you change your database, and make your processes more automated?  And make sure it updates annually.”
Me: “Sure.”


#3    Colin Wyers      (see all posts) 2008/04/16 (Wed) @ 18:16

Looks great so far. I use MySQL at home, and so when I get off work tonight I should be able to provide anyone that wants it with the “conversions.”


#4    Tangotiger      (see all posts) 2008/04/16 (Wed) @ 21:31

I made updates to the main blog entry.  I might just post links to the files instead of embedding the text within the blog entry.


#5    Tangotiger      (see all posts) 2008/04/16 (Wed) @ 22:30

I noticed a couple of fields I made a string, but they should be integers.  Such is the curse of my releasing beta-instructions.

I suggest that if you load your database, only do it with the 2007 data.  Things will change.


#6    Matt Mitchell      (see all posts) 2008/04/16 (Wed) @ 23:00

Tom,

Since I clearly don’t know everything, I’m assuming you’re using a Mac or other UNIX-based system, correct? Just want to clear before I comment with the Windows-based solutions I’m working with, though most of them appear to be in Joe’s book in some form.


#7    Colin Wyers      (see all posts) 2008/04/16 (Wed) @ 23:04

He’s using batch files, so that means he’s using DOS/Windows. He’s just using Oracle, a heavy-duty database far beyond the reach (or needs, really) of most of us mortals. That said, everything he uses to generate the CSV files with Chadwick will apply to any database system (even Access, if you’d like.) It’s getting those CSV files into the database that won’t work as-is on other systems.


#8    tangotiger      (see all posts) 2008/04/16 (Wed) @ 23:13

I’m using Windows, though all this should be portable to *NIX.

There are basic free versions of Oracle available, like Oracle Express.


#9    Colin Wyers      (see all posts) 2008/04/16 (Wed) @ 23:27

Apparently so.

“Oracle Database XE can be installed on any size host machine with any number of CPUs (one database per machine), but XE will store up to 4GB of user data, use up to 1GB of memory, and use one CPU on the host machine.”

If all you’re using it for is one Retrosheet database, I can’t see you running into the 4gb limit. If you have multiple CPUs or more than a gig of RAM, you may notice a performance decline, I suppose.


#10    Matt Mitchell      (see all posts) 2008/04/17 (Thu) @ 00:46

Thanks for the clarification. I too, like Colin, am using MySQL to try and load the event files. I also am using the Perl scripts from Baseball Hacks, available from the book’s website. (I’m still illiterate as to adding a link here, so feel free to do so for me).

Colin, what challenges did you run into with MySQL?


#11    Colin Wyers      (see all posts) 2008/04/17 (Thu) @ 01:57

Matt: I used BEVENT and a different set of definitions. I’ve done my best to duplicate Tango’s database definition in MySQL. Here goes:

CREATE TABLE events (
SEQ_EVENTS INTEGER NOT NULL UNIQUE KEY AUTO_INCREMENT
,GAME_ID VARCHAR(12)
,AWAY_TEAM_ID VARCHAR(3)
,INN_CT INTEGER
,BAT_HOME_ID INTEGER
,OUTS_CT INTEGER
,BALLS_CT INTEGER
,STRIKES_CT INTEGER
,PITCH_SEQ_TX VARCHAR(40)
,AWAY_SCORE_CT INTEGER
,HOME_SCORE_CT INTEGER
,BAT_ID VARCHAR(8)
,BAT_HAND_CD VARCHAR(1)
,RESP_BAT_ID VARCHAR(8)
,RESP_BAT_HAND_CD VARCHAR(1)
,PIT_ID VARCHAR(8)
,PIT_HAND_CD VARCHAR(1)
,RESP_PIT_ID VARCHAR(8)
,RESP_PIT_HAND_CD VARCHAR(1)
,POS2_FLD_ID VARCHAR(8)
,POS3_FLD_ID VARCHAR(8)
,POS4_FLD_ID VARCHAR(8)
,POS5_FLD_ID VARCHAR(8)
,POS6_FLD_ID VARCHAR(8)
,POS7_FLD_ID VARCHAR(8)
,POS8_FLD_ID VARCHAR(8)
,POS9_FLD_ID VARCHAR(8)
,BASE1_RUN_ID VARCHAR(8)
,BASE2_RUN_ID VARCHAR(8)
,BASE3_RUN_ID VARCHAR(8)
,EVENT_TX VARCHAR(100)
,LEADOFF_FL VARCHAR(1)
,PH_FL VARCHAR(1)
,BAT_FLD_CD VARCHAR(2)
,BAT_LINEUP_ID INTEGER
,EVENT_CD INTEGER
,BAT_EVENT_FL VARCHAR(1)
,AB_FL VARCHAR(1)
,H_CD INTEGER
,SH_FL VARCHAR(1)
,SF_FL VARCHAR(1)
,EVENT_OUTS_CT INTEGER
,DP_FL VARCHAR(1)
,TP_FL VARCHAR(1)
,RBI_CT INTEGER
,WP_FL VARCHAR(1)
,PB_FL VARCHAR(1)
,FLD_CD INTEGER
,BATTEDBALL_CD VARCHAR(1)
,BUNT_FL VARCHAR(1)
,FOUL_FL VARCHAR(1)
,BATTEDBALL_LOC_TX VARCHAR(5)
,ERR_CT INTEGER
,ERR1_FLD_CD VARCHAR(8)
,ERR1_CD VARCHAR(1)
,ERR2_FLD_CD VARCHAR(8)
,ERR2_CD VARCHAR(1)
,ERR3_FLD_CD VARCHAR(8)
,ERR3_CD VARCHAR(1)
,BAT_DEST_ID INTEGER
,RUN1_DEST_ID INTEGER
,RUN2_DEST_ID INTEGER
,RUN3_DEST_ID INTEGER
,BAT_PLAY_TX VARCHAR(8)
,RUN1_PLAY_TX VARCHAR(15)
,RUN2_PLAY_TX VARCHAR(15)
,RUN3_PLAY_TX VARCHAR(15)
,RUN1_SB_FL VARCHAR(1)
,RUN2_SB_FL VARCHAR(1)
,RUN3_SB_FL VARCHAR(1)
,RUN1_CS_FL VARCHAR(1)
,RUN2_CS_FL VARCHAR(1)
,RUN3_CS_FL VARCHAR(1)
,RUN1_PK_FL VARCHAR(1)
,RUN2_PK_FL VARCHAR(1)
,RUN3_PK_FL VARCHAR(1)
,RUN1_RESP_PIT_ID VARCHAR(8)
,RUN2_RESP_PIT_ID VARCHAR(8)
,RUN3_RESP_PIT_ID VARCHAR(8)
,GAME_NEW_FL VARCHAR(1)
,GAME_END_FL VARCHAR(1)
,PR_RUN1_FL VARCHAR(1)
,PR_RUN2_FL VARCHAR(1)
,PR_RUN3_FL VARCHAR(1)
,REMOVED_FOR_PR_RUN1_ID VARCHAR(8)
,REMOVED_FOR_PR_RUN2_ID VARCHAR(8)
,REMOVED_FOR_PR_RUN3_ID VARCHAR(8)
,REMOVED_FOR_PH_BAT_ID VARCHAR(8)
,REMOVED_FOR_PH_BAT_FLD_CD VARCHAR(8)
,PO1_FLD_CD INTEGER
,PO2_FLD_CD INTEGER
,PO3_FLD_CD INTEGER
,ASS1_FLD_CD INTEGER
,ASS2_FLD_CD INTEGER
,ASS3_FLD_CD INTEGER
,ASS4_FLD_CD INTEGER
,ASS5_FLD_CD INTEGER
,EVENT_ID INTEGER
,HOME_TEAM_ID VARCHAR(3)
,BAT_TEAM_ID VARCHAR(3)
,FLD_TEAM_ID VARCHAR(3)
,BAT_LAST_ID INTEGER
,INN_NEW_FL VARCHAR(1)
,INN_END_FL VARCHAR(1)
,START_BAT_SCORE_CT INTEGER
,START_FLD_SCORE_CT INTEGER
,INN_RUNS_CT INTEGER
,GAME_PA_CT INTEGER
,INN_PA_CT INTEGER
,PA_NEW_FL VARCHAR(1)
,PA_TRUNC_FL VARCHAR(1)
,START_BASES_CD INTEGER
,END_BASES_CD INTEGER
,BAT_START_FL VARCHAR(1)
,PIT_START_FL VARCHAR(1)
,RUN1_FLD_CD VARCHAR(2)
,RUN1_LINEUP_CD INTEGER
,RUN1_ORIGIN_EVENT_ID INTEGER
,RUN2_FLD_CD VARCHAR(2)
,RUN2_LINEUP_CD INTEGER
,RUN2_ORIGIN_EVENT_ID INTEGER
,RUN3_FLD_CD VARCHAR(2)
,RUN3_LINEUP_CD INTEGER
,RUN3_ORIGIN_EVENT_ID INTEGER
,PA_BALL_CT INTEGER
,PA_INTENT_BALL_CT INTEGER
,PA_PITCHOUT_BALL_CT INTEGER
,PA_OTHER_BALL_CT INTEGER
,PA_STRIKE_CT INTEGER
,PA_CALLED_STRIKE_CT INTEGER
,PA_SWINGMISS_STRIKE_CT INTEGER
,PA_FOUL_STRIKE_CT INTEGER
,PA_OTHER_STRIKE_CT INTEGER
,EVENT_RUNS_CT INTEGER
,FLD_ID VARCHAR(8)
,BASE2_FORCE_FL VARCHAR(1)
,BASE3_FORCE_FL VARCHAR(1)
,BASE4_FORCE_FL VARCHAR(1)
,BAT_SAFE_ERR_FL VARCHAR(1)
,BAT_FATE_ID INTEGER
,RUN1_FATE_ID INTEGER
,RUN2_FATE_ID INTEGER
,RUN3_FATE_ID INTEGER
,FATE_RUNS_CT INTEGER
,ASS6_FLD_CD INTEGER
,ASS7_FLD_CD INTEGER
,ASS8_FLD_CD INTEGER
,ASS9_FLD_CD INTEGER
,ASS10_FLD_CD INTEGER
,PRIMARY KEY (GAME_ID, EVENT_ID)
);

I seriously need to get to bed - I’ll try and post something on how to load the data in MySQL tomorrow.


#12    joe arthur      (see all posts) 2008/04/17 (Thu) @ 07:34

For an Oracle database, defining the numeric fields as date type PLS_INTEGER(1) or PLS_INTEGER(2) is preferable to “INTEGER”, which supposedly is treated by Oracle as an alias for NUMBER(38) - much bigger than needed and not actually an integer. In mysql, the equivalent is smallint(1) or smallint(2).


#13    Tangotiger      (see all posts) 2008/04/17 (Thu) @ 09:14

My problem with the smallint is when you start to do sum(myCol).  With 8 million records in the database, the datatype of the smallint may carry through to the sum column.

I should look into making the database leaner, though.

***

Colin: the memory requirement looks ambiguous, but you will not suffer any kind of degradation if you have too much memory.  As for the multiple CPU, that is an Oracle “honor system” requirement to limit the use of the database to single non-power users.  As for the 4GB, you will get pretty close with the Retro database.  With 8 million records, you just need 500 bytes per record to reach that limit.  I think I’m pretty close to that, though I’ll have to check at home.  Of course, you don’t have to load the entire set, or you can go to MySQL or the other free ones out there that don’t have such a restriction.


#14    Anthony      (see all posts) 2008/04/17 (Thu) @ 10:31

Regarding the database size, can Access handle something this large, or will I need to get MySQL?


#15    Tangotiger      (see all posts) 2008/04/17 (Thu) @ 10:45

I have found that when you get passed 1 million records, Access starts to choke.  In the current years, and season is almost 200,000 records.

On the other hand, Access DOES allow you to connect through an ODBC connection to other databases (easy to setup).  So, you can use the wonderful front-end of Access, to get to your other databases.


#16    Tangotiger      (see all posts) 2008/04/17 (Thu) @ 11:01

I was a little sloppy with FLD_xx

Any field that ends with… means…
FLD_ID: his 8 character player id, as a text string
FLD_CD: numeric fielding position

So, it looks like I have it right in some places, and wrong in others.  If CWEVENT is spitting out a player id, then the field should end with FLD_ID as a VARCHAR2(8).  If CWEVENT is spitting out a position, then the field should end with FLD_CD as an INTEGER.

I’ll correct that tonight.

***

You will note a certain standard in the field names.  The last two characters all end with one of:
_CD: a code
_CT: a counter, number
_FL: a switch, flag
_ID: an identifier (names, teams, etc)
_TX: a string, text

Based on this, we know basically what the field is about.

The set of characters directly preceding the suffix is the “object”.  So,
BAT_ID
RESP_BAT_ID
REMOVED_FOR_PH_BAT_ID

All refer to the actual batter’s player ID.  The text preceding “BAT_ID” gives it additional context.  So, the first field is strictly the batter.  The second field is the “responsible” batter (for cases in mid-PA switches).  And the third field is the batter who was removed (not the pinch hitter).

Similarly:
BASE1_RUN_ID
BASE2_RUN_ID
BASE3_RUN_ID

Here we are talking about the player id of a runner.  The BASE1 refers to the runner on first base.

If you think that some of my names are not consistent, please speak up.  Once we’re done here, this will be pretty much set in stone.  Changing field names is not something that we like to do.

The full description of each field can be obtained by running CWEVENT with the -h flag.


#17    tangotiger      (see all posts) 2008/04/17 (Thu) @ 21:39

These are the corrections to the datatype:
ALTER TABLE EVENTS
MODIFY (ERR1_FLD_CD INTEGER )
MODIFY (ERR2_FLD_CD INTEGER )
MODIFY (ERR3_FLD_CD INTEGER )
MODIFY (BAT_FLD_CD INTEGER )
MODIFY (RUN1_FLD_CD INTEGER )
MODIFY (RUN2_FLD_CD INTEGER )
MODIFY (RUN3_FLD_CD INTEGER )

The field names are correct.  If you already loaded the table, you may be prevented from changing the datatype.  The easiest way out of the mess is to start from scratch.

***

The Retrosheet schema takes up 3.9 GB in Oracle.  Add in the sysadmin schemas and the temp schema, and you’ll be above the 4GB limit.  I intend to create additional tables, so look for a database that can handle more than 5 or 6 GB.


#18    tangotiger      (see all posts) 2008/04/17 (Thu) @ 22:04

ALTER TABLE EVENTS
RENAME COLUMN RUN1_LINEUP_CD to RUN1_LINEUP_ID
RENAME COLUMN RUN2_LINEUP_CD to RUN2_LINEUP_ID
RENAME COLUMN RUN3_LINEUP_CD to RUN3_LINEUP_ID


#19    Tangotiger      (see all posts) 2008/04/29 (Tue) @ 20:29

I’ll be taking my posts here:

Until I’ve got something more production-ready, at which point I’ll make an announcement here.

I don’t think my thread here was the right format.


#20    tangotiger      (see all posts) 2008/05/05 (Mon) @ 22:38

I updated the wiki:

http://www.tangotiger.net/wiki/index.php?title=Retrosheet_Database


#21    terpsfan101      (see all posts) 2009/11/30 (Mon) @ 15:49

I wasn’t sure what thread I should post this. I noticed an error with the Chadwick program when processing plays where 99 (unknown fielding credit) is listed in the event text. Chadwick mistakengly assigns a putout and an assist to the rightfielder (pos #9) on these plays. It also mistakengly assigns a 9 in the FLD_CD column when the BAT_EVENT_FL is set to T.


#22    Tangotiger      (see all posts) 2009/11/30 (Mon) @ 15:51

Great job.  I will alert Ted, and check this out myself tonight.


#23    Tangotiger      (see all posts) 2009/11/30 (Mon) @ 15:59

I posted your comment to Retrolist, and I also said:

“We discussed this last year, and when I process this data, I ignore anything in the putout/assist fields on the “99” plays.  So, I don’t know if this is an “error” so much as a design decision (of CWEVENT and BEVENT).  If I remember right, a value was needed so the counts are correct.  And for whatever reason, the choice was limited to 1-9.”


#24    terpsfan101      (see all posts) 2009/11/30 (Mon) @ 18:52

The fielding portion of a few event text’s would start with 99, like a catch and baserunner kill by the rightfielder. I wonder if the unknown fielding credit flag is set to T for these plays. If not, then you would want to make sure you use the unknonw fielding credit flag to find these plays, not just the event text, as the event text would show a few false positives for those plays where the fielding credit starts with 99.


#25    john      (see all posts) 2009/12/15 (Tue) @ 16:23

Ok.  I’ve done this last year and I was trying to update it this year and running into issues.

1.  I’m using Mysql
2.  I’ve downloaded the cwevent,cwgame files etc etc.
3.  I downloaded the event files
4.  I unzipped them
5.  I have in the parse folder all2005 to all2009.csv, games2005 to games2009.csv, and subs2005 to subs2009.csv.

6. I used Colins events table to create the table.

thats about where I’m at.  I’m having issues loading the csv files into Mysql.  Do I need to import the files one by one?  Whats the difference between all2005, games2005 and subs2005?  Do I need them all?  Do I need any other table but “events”?  Any help would be greatly appreciated.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 08 15:23
When is a life entity considered a person?

Feb 08 15:14
New PECOTA

Feb 08 14:44
When to purposefully lose the lead

Feb 08 13:49
The will of the people?

Feb 08 11:43
Is Nate Silver alot more certain than he lets on?

Feb 08 09:02
Forecaster’s Challenge: 2012?

Feb 08 07:43
For Your Soul

Feb 08 02:00
Batman, the webslinger?

Feb 08 01:22
Why I’d Bet on My Model (and Against My Instincts)

Feb 07 20:05
Golfers “playing through”