THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, November 17, 2008

BDB Database (MS Access)

By Tangotiger, 01:01 PM

Here you go.


SabermetricsData
#1    TangoTiger      (see all posts) 2008/11/17 (Mon) @ 16:19

I have updated the shell DB scripts.  The timestamp shows:
17-Nov-2008 12:08

I uploaded it 15:08ET


#2    terpsfan101      (see all posts) 2008/11/17 (Mon) @ 16:53

Can you include a script that determines a player’s primary position?


#3    Tangotiger      (see all posts) 2008/11/17 (Mon) @ 17:28

I’ve posted this on this blog.  Do a search for “Hacks”, and you should get it.  All you need to do is cut/paste it.


#4    terpsfan101      (see all posts) 2008/11/17 (Mon) @ 17:51

Thanks for the DB shell and the Hacks. What is the G_Old field in the batting table?


#5    Rally      (see all posts) 2008/11/17 (Mon) @ 17:59

They added a field G_batted, which looks to me as a game where the player appeared in a batting order (pretty much anyone except the AL pitcher) regardless of whether he actually batted.

Not sure yet where Herb Washington’s pinch running fits in.  I’m guessing they looked at a new method to determine games played, and threw the old games played total at the end.  It would be interesting to see what players these differ for.


#6    Tangotiger      (see all posts) 2008/11/17 (Mon) @ 18:07

The underlying data is distributed via baseball-databank and you should join the Yahoo group that supports it.  Details of any changes were listed there last week.

***

Btw, the league Runs per game seems to be 4.70, and the wOBA is .330.  Second number sounds a bit low, but it doesn’t include RBOE.


#7    terpsfan101      (see all posts) 2008/11/17 (Mon) @ 18:27

Tango,

I have been keeping up with the BDB update. I haven’t seen a message so far that says what the G_OLD column is.

Off topic, I was going to wait until Retrosheet releases 2008 data, to submit ROE, XI, GIDP Opp, Picked-off, and batted-ball data for batters. For pitchers, I have GIDP, 2B, 3B, ROE, SF and a few other categories. I also have complete SH allowed data for pitchers dating back to the early 1920’s. Do you think it is a good idea for me to wait until Retrosheet releases 2008 data before I submit it?


#8    Tangotiger      (see all posts) 2008/11/17 (Mon) @ 20:32

My guess is that better to have it with the 2008 data.


#9    terpsfan101      (see all posts) 2008/11/17 (Mon) @ 23:11

Yes, I will wait until the next Retrosheet release. Hopefully some more missing data will get filled in for seasons prior to 1974. The AL PBP data is actually complete dating back to 1963.

Thanks again for the database shell.


#10    Tangotiger      (see all posts) 2008/11/18 (Tue) @ 17:56

I uploaded a csv file with the primary position of all players.  Load it into a table, and you are all set to go.


#11    terpsfan101      (see all posts) 2008/11/18 (Tue) @ 17:58

Excellent!


#12    Tangotiger      (see all posts) 2008/11/19 (Wed) @ 10:53

I should also explain how I did the primary position:

1. First figured out if he was a pitcher or non-pitcher, each season.  That was easy except for Ruth.

2. If a non-pitcher, figured if he was a catcher or non-catcher.  (Min 50% of innings as a catcher)

3. If he was a non-catcher, figured if he was an IF/OF or a 1B/DH.

4. If he was an IF/OF, figured if he was an IF or OF.  If he was a 1B/DH, figured if he was a 1B or DH.

5.  If he was an IF, figured if he was a SS, 2B, or 3B.  If he was an OF, figured if he was a CF, RF, or LF.

I know that’s not how most people do it, but that’s the way it really work.  For example, if a guy plays 40 games at 2B, 30 in LF and 30 in RF, why should that be any different than 40 games at 2B and 60 games in LF?  As you guys know, I consider LF/RF completely interchangeable, much like I would consider a LW/RW in hockey interchangeable, and a WR that lines up on the left side or the right side.

You could have a crazy situation where you have these number of games per position:
40 C
10 2B
10 SS
10 3B
10 LF
10 CF
10 RF

And my algorithm will actually spit out “SS”.  Yes, I agree, silly.  But, my method gives out less silly results than the standard methodology.

In reality, I don’t know what the actual (rather than theoretical) worst classifications are based on my methodology versus the standard.  I would guess my way is better.


#13    terpsfan101      (see all posts) 2008/11/28 (Fri) @ 07:20

Tango,

How did your algorithm determine the primary position of player’s who played on multiple teams during the same season?

I still can’t devise a query that identifies players with multiple stints on the same team during the same season. Would you know how to do this?

Retrosheet released 2008 data yesterday, and updated some pre-1974 seasons, so I am going to re-compile my statistics later today.


#14    terpsfan101      (see all posts) 2008/12/16 (Tue) @ 20:46

Here are the categories I have compiled so far from Retrosheet data. Let me know if I should include something that is not listed here. Also let me know if you think certain categories are unnecessary, or can be combined with other categories.

Batting

Retrosheet Plate Appearances
Catcher’s Interference
Reached on Error
Reached on Fielder’s Choice No Out Recorded
Pickoff (Excludes Pickoff CS and Pickoff Errors where pitcher or catcher did not record an assist)
Pickoff CS
Pickoff Error Advance (Excludes Pickoff CS and Pickoff Errors where the baserunner was actually picked off (pitcher or catcher credited with an assist))
Balk Advance (Lead Runner, except with runners on 1st and 3rd where credit was given to the runner on 1st).
GIDP Opportunities (Runner on 1st, less than 2 out)
GB Outs
FB Outs
LD Outs
Bunt Outs
Bunts
Bunt Hits
Bunt Reached on Error/Fielder’s Choice (Excludes Sacrifices)
Bunt SO
Bunt SAC Attempts (Dan Fox’s defintion of a sacrifice attempt)
Reached on Sac Bunt

Pitching

Retrosheet IP Outs
Retrosheet BFP
Catcher’s Interference
Doubles
Triples
GIDP
Reached on Error
Reached on Fielder’s Choice No Out Recorded
Stolen Bases
Caught Stealing
Pickoff
Pickoff CS
Pickoff Error (Excluding plays where pitcher/catcher were not listed as the first error)
Passed Balls
GIDP Opportunities
GB Outs
FB Outs
LD Outs
Bunt Outs
SH
Reached on Sac Bunt
SF

Fielding

Catcher Pickoffs, Pickoff CS, and Pickoff Errors
More Fielding stats to come in the future.


#15    terpsfan101      (see all posts) 2008/12/16 (Tue) @ 23:18

I would like to add infield singles. This would be a mess to try and figure out. Some years Retrosheet doesn’t have data for who fielded the ball. Other years, they have hit location data, but I’m not sure what zones to use for IF singles. Some years Retrosheet doesn’t have the fielder code and hit location data. I know Rally figures IF hits for his TotalZone. Maybe he could give me some advice here.


#16    Tangotiger      (see all posts) 2008/12/17 (Wed) @ 08:05

"How did your algorithm determine the primary position of player’s who played on multiple teams during the same season?”

I group by player, year.

“I still can’t devise a query that identifies players with multiple stints on the same team during the same season. Would you know how to do this? “

select playerid,yearid,teamid,sum(1) as num_of_stints
from batting
group by playerid,yearid,teamid
having num_of_stints > 1


#17    terpsfan101      (see all posts) 2008/12/17 (Wed) @ 15:36

Tango, Thanks for the stint query. I’ve been trying to figure that out for 3 months.


#18    Tangotiger      (see all posts) 2008/12/17 (Wed) @ 15:49

The one good thing about Access is that they make it very easy to do queries of queries. 

I STRONGLY suggest that when you run queries, make them do AS LITTLE as you can, like the above.  For example, your VERY FIRST query should be to group the data in the batting table by eliminating stints.

So, you should do:
select playerid,yearid,teamid,sum(ab) as ab, sum(h) as h, and so on…
from batting
group by playerid,yearid,teamid

You call this query BATTING2

You can also add sum(1) as num_of_stints to the list in the select statement, so you know how many records you collapsed into one.

Then, you can have BATTING2_YEAR

And in this case, you can either select from BATTING2 or from BATTING and make your group by as:
group by playerid,yearid

And so on.  If you select from BATTING2, then you can also include:
sum(1) as num_of_teams
So you know how many teams he played for that season (while not double-counting STINTS).

The larger point is that you should use the visual query from Access, because it is a FANTASTIC way to learn.


#19    Rally      (see all posts) 2008/12/17 (Wed) @ 16:23

That’s opposite how I do Access queries.  I try to do as much as possible, because otherwise I’ll mess things up- change one thing in a query and then 4-5 queries that use that one are affected.


#20    Tangotiger      (see all posts) 2008/12/17 (Wed) @ 16:55

Rally: I meant for beginners, so that they can see what’s going on and how thing behave.  Once you are good, then, certainly, load up as much as you can.


#21    terpsfan101      (see all posts) 2008/12/17 (Wed) @ 17:13

Yes, Tango’s method was how I used to do queries in Access when I first started out. At the time, I wasn’t aware that you could combine multiple steps into a single query. Once I became more familiar with Access, I realized that I could start combining queries. The great thing about Access is that you can create a new field in a query, and then reference that field in the same query.

I would certainly recommend creating seperate tables for the Hitting, Fielding, and Pitching tables in the BDB database, that combine the multiple stint players on the same team in the same year. All you have to do is group by yearID, playerID, and teamID.

One thing you need to be careful about when combining data from a table is null values. For instance, in Access “3 + NULL = NULL”. I get around this by filling in zeroes for the null values. However, there is probably a more efficient way to get around the null values.


#22    Rally      (see all posts) 2008/12/17 (Wed) @ 17:24

I gave mySQL a try when Colin put up a good post on using that, but I had some trouble (some of it my computer) and went back to Access.  In mySQL, there is a coalesce function that handles null values, but it doesn’t work in Access (perhaps there’s a function but it’s named differently?).  I just find Access to be very user friendly.

Tango’s right for beginners.  For me, I’ve got queries that link so many tables together the screen looks like a spiderweb.


#23    Tangotiger      (see all posts) 2008/12/17 (Wed) @ 17:26

Nz(HBP)

That turns nulls into zeroes.  That’s the equivalent of:

coalesce(HBP,0)


#24    Tangotiger      (see all posts) 2008/12/17 (Wed) @ 17:27

In Access, when you open up a query, right-click in the field name field, and you can click “build”, and then “built-in functions”.  That gives you the entire list of things you can do.


#25    Rally      (see all posts) 2008/12/17 (Wed) @ 17:39

Sweet.  Thanks.


#26    terpsfan101      (see all posts) 2008/12/17 (Wed) @ 19:43

I know these are garbage stats, but I have added Holds, Blown Saves, Inherited Runners, and Inherited Runners Scored to the list in #14.

I will take a look at adding some sort of baserunning category that deals with baserunner advancement on singles and doubles. I guess I could figure Baserunner Holds for outfielders while I am compiling the baserunning information.


#27    terpsfan101      (see all posts) 2008/12/19 (Fri) @ 01:52

I am also re-compiling the Postseason Batting, Pitching, and Fielding tables from the Retrosheet Postseason Event files (1903-2008). There is missing information all over the place in the postseason tables. For instance, there is no fielding data from 2005-2008.


#28          (see all posts) 2008/12/19 (Fri) @ 09:09

I just posted an article at StatSpeak http://statspeak.net/2008/12/retrosheet-game-logs.html which examines the two different versions of games tables, the ones from the Retrosheet game logs, and those generated by cwgame from the event files.

I offer some advice on how to load, modify and use both tables, with links to Tango’s DDL code for the cwgame output, and my own DDL for the Retro game logs.

Comments and corrections welcome, both here and at StatSpeak.


#29    terpsfan101      (see all posts) 2008/12/19 (Fri) @ 16:41

I actually used cwgame for the first time last night. Since there are no postseason gamelogs, I needed to use cwgame to figure out Wins, Losses, and Saves for pitchers, and Defensive GS for position players. I used Tango’s SQL that he provides on his Retrosheet Database wiki page to create the table. In Access, you need to change “varchar” to “text” in order to use Tango’s SQL’s.

The main difference between the Retrosheet gamelogs and those generated from cwgame is the number of fields. The gamelog files have 160+ fields, although many of them are redundant.


#30    terpsfan101      (see all posts) 2008/12/26 (Fri) @ 07:20

The data that I promised here should be ready by the end of the weekend. I keep getting side-tracked with a million other things that I am working on simultaneously. I will post the data to the BDB Yahoo Group in a zip file (even though I prefer the RAR format for archives). RAR has a much higher compression rate than the zip format, and the RAR format also stores recovery information in case the archive is damaged.


#31          (see all posts) 2009/01/12 (Mon) @ 11:38

Tango,

I sent you a zip file with the preliminary data additions that I have compiled for the Baseball Databank databse. When you find some free time, I would like to get your input on the data additions before I submit them to the Yahoo group.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Jul 04 20:59
Mapping IDs

Jul 04 01:40
BPro Idol

Jul 03 01:39
sUZR v bUZR

Jul 02 21:15
Batting Order and the pitcher

Jun 30 07:22
NHL draft analysis and spreadsheet 1994-2009

Jun 30 04:14
The Poz goes FJM on Harold Reynolds’ a$$ - gather around the kids

Jun 30 00:11
Blogosphere Question of the Day, 06/24; OR Why should OPS die?

Jun 27 16:04
Loss aversion in golf

Jun 26 16:30
Donald Fehr

Jun 26 14:04
Barry Code