THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, November 17, 2008

BDB Database (MS Access)

By Tangotiger, 01:01 PM

Here you go.


SabermetricsData
#1    TangoTiger      (see all posts) 2008/11/17 (Mon) @ 16:19

I have updated the shell DB scripts.  The timestamp shows:
17-Nov-2008 12:08

I uploaded it 15:08ET


#2    terpsfan101      (see all posts) 2008/11/17 (Mon) @ 16:53

Can you include a script that determines a player’s primary position?


#3    Tangotiger      (see all posts) 2008/11/17 (Mon) @ 17:28

I’ve posted this on this blog.  Do a search for “Hacks”, and you should get it.  All you need to do is cut/paste it.


#4    terpsfan101      (see all posts) 2008/11/17 (Mon) @ 17:51

Thanks for the DB shell and the Hacks. What is the G_Old field in the batting table?


#5    Rally      (see all posts) 2008/11/17 (Mon) @ 17:59

They added a field G_batted, which looks to me as a game where the player appeared in a batting order (pretty much anyone except the AL pitcher) regardless of whether he actually batted.

Not sure yet where Herb Washington’s pinch running fits in.  I’m guessing they looked at a new method to determine games played, and threw the old games played total at the end.  It would be interesting to see what players these differ for.


#6    Tangotiger      (see all posts) 2008/11/17 (Mon) @ 18:07

The underlying data is distributed via baseball-databank and you should join the Yahoo group that supports it.  Details of any changes were listed there last week.

***

Btw, the league Runs per game seems to be 4.70, and the wOBA is .330.  Second number sounds a bit low, but it doesn’t include RBOE.


#7    terpsfan101      (see all posts) 2008/11/17 (Mon) @ 18:27

Tango,

I have been keeping up with the BDB update. I haven’t seen a message so far that says what the G_OLD column is.

Off topic, I was going to wait until Retrosheet releases 2008 data, to submit ROE, XI, GIDP Opp, Picked-off, and batted-ball data for batters. For pitchers, I have GIDP, 2B, 3B, ROE, SF and a few other categories. I also have complete SH allowed data for pitchers dating back to the early 1920’s. Do you think it is a good idea for me to wait until Retrosheet releases 2008 data before I submit it?


#8    Tangotiger      (see all posts) 2008/11/17 (Mon) @ 20:32

My guess is that better to have it with the 2008 data.


#9    terpsfan101      (see all posts) 2008/11/17 (Mon) @ 23:11

Yes, I will wait until the next Retrosheet release. Hopefully some more missing data will get filled in for seasons prior to 1974. The AL PBP data is actually complete dating back to 1963.

Thanks again for the database shell.


#10    Tangotiger      (see all posts) 2008/11/18 (Tue) @ 17:56

I uploaded a csv file with the primary position of all players.  Load it into a table, and you are all set to go.


#11    terpsfan101      (see all posts) 2008/11/18 (Tue) @ 17:58

Excellent!


#12    Tangotiger      (see all posts) 2008/11/19 (Wed) @ 10:53

I should also explain how I did the primary position:

1. First figured out if he was a pitcher or non-pitcher, each season.  That was easy except for Ruth.

2. If a non-pitcher, figured if he was a catcher or non-catcher.  (Min 50% of innings as a catcher)

3. If he was a non-catcher, figured if he was an IF/OF or a 1B/DH.

4. If he was an IF/OF, figured if he was an IF or OF.  If he was a 1B/DH, figured if he was a 1B or DH.

5.  If he was an IF, figured if he was a SS, 2B, or 3B.  If he was an OF, figured if he was a CF, RF, or LF.

I know that’s not how most people do it, but that’s the way it really work.  For example, if a guy plays 40 games at 2B, 30 in LF and 30 in RF, why should that be any different than 40 games at 2B and 60 games in LF?  As you guys know, I consider LF/RF completely interchangeable, much like I would consider a LW/RW in hockey interchangeable, and a WR that lines up on the left side or the right side.

You could have a crazy situation where you have these number of games per position:
40 C
10 2B
10 SS
10 3B
10 LF
10 CF
10 RF

And my algorithm will actually spit out “SS”.  Yes, I agree, silly.  But, my method gives out less silly results than the standard methodology.

In reality, I don’t know what the actual (rather than theoretical) worst classifications are based on my methodology versus the standard.  I would guess my way is better.


#13    terpsfan101      (see all posts) 2008/11/28 (Fri) @ 07:20

Tango,

How did your algorithm determine the primary position of player’s who played on multiple teams during the same season?

I still can’t devise a query that identifies players with multiple stints on the same team during the same season. Would you know how to do this?

Retrosheet released 2008 data yesterday, and updated some pre-1974 seasons, so I am going to re-compile my statistics later today.


#14    terpsfan101      (see all posts) 2008/12/16 (Tue) @ 20:46

Here are the categories I have compiled so far from Retrosheet data. Let me know if I should include something that is not listed here. Also let me know if you think certain categories are unnecessary, or can be combined with other categories.

Batting

Retrosheet Plate Appearances
Catcher’s Interference
Reached on Error
Reached on Fielder’s Choice No Out Recorded
Pickoff (Excludes Pickoff CS and Pickoff Errors where pitcher or catcher did not record an assist)
Pickoff CS
Pickoff Error Advance (Excludes Pickoff CS and Pickoff Errors where the baserunner was actually picked off (pitcher or catcher credited with an assist))
Balk Advance (Lead Runner, except with runners on 1st and 3rd where credit was given to the runner on 1st).
GIDP Opportunities (Runner on 1st, less than 2 out)
GB Outs
FB Outs
LD Outs
Bunt Outs
Bunts
Bunt Hits
Bunt Reached on Error/Fielder’s Choice (Excludes Sacrifices)
Bunt SO
Bunt SAC Attempts (Dan Fox’s defintion of a sacrifice attempt)
Reached on Sac Bunt

Pitching

Retrosheet IP Outs
Retrosheet BFP
Catcher’s Interference
Doubles
Triples
GIDP
Reached on Error
Reached on Fielder’s Choice No Out Recorded
Stolen Bases
Caught Stealing
Pickoff
Pickoff CS
Pickoff Error (Excluding plays where pitcher/catcher were not listed as the first error)
Passed Balls
GIDP Opportunities
GB Outs
FB Outs
LD Outs
Bunt Outs
SH
Reached on Sac Bunt
SF

Fielding

Catcher Pickoffs, Pickoff CS, and Pickoff Errors
More Fielding stats to come in the future.


#15    terpsfan101      (see all posts) 2008/12/16 (Tue) @ 23:18

I would like to add infield singles. This would be a mess to try and figure out. Some years Retrosheet doesn’t have data for who fielded the ball. Other years, they have hit location data, but I’m not sure what zones to use for IF singles. Some years Retrosheet doesn’t have the fielder code and hit location data. I know Rally figures IF hits for his TotalZone. Maybe he could give me some advice here.


#16    Tangotiger      (see all posts) 2008/12/17 (Wed) @ 08:05

"How did your algorithm determine the primary position of player’s who played on multiple teams during the same season?”

I group by player, year.

“I still can’t devise a query that identifies players with multiple stints on the same team during the same season. Would you know how to do this? “

select playerid,yearid,teamid,sum(1) as num_of_stints
from batting
group by playerid,yearid,teamid
having num_of_stints > 1


#17    terpsfan101      (see all posts) 2008/12/17 (Wed) @ 15:36

Tango, Thanks for the stint query. I’ve been trying to figure that out for 3 months.


#18    Tangotiger      (see all posts) 2008/12/17 (Wed) @ 15:49

The one good thing about Access is that they make it very easy to do queries of queries. 

I STRONGLY suggest that when you run queries, make them do AS LITTLE as you can, like the above.  For example, your VERY FIRST query should be to group the data in the batting table by eliminating stints.

So, you should do:
select playerid,yearid,teamid,sum(ab) as ab, sum(h) as h, and so on…
from batting
group by playerid,yearid,teamid

You call this query BATTING2

You can also add sum(1) as num_of_stints to the list in the select statement, so you know how many records you collapsed into one.

Then, you can have BATTING2_YEAR

And in this case, you can either select from BATTING2 or from BATTING and make your group by as:
group by playerid,yearid

And so on.  If you select from BATTING2, then you can also include:
sum(1) as num_of_teams
So you know how many teams he played for that season (while not double-counting STINTS).

The larger point is that you should use the visual query from Access, because it is a FANTASTIC way to learn.


#19    Rally      (see all posts) 2008/12/17 (Wed) @ 16:23

That’s opposite how I do Access queries.  I try to do as much as possible, because otherwise I’ll mess things up- change one thing in a query and then 4-5 queries that use that one are affected.


#20    Tangotiger      (see all posts) 2008/12/17 (Wed) @ 16:55

Rally: I meant for beginners, so that they can see what’s going on and how thing behave.  Once you are good, then, certainly, load up as much as you can.


#21    terpsfan101      (see all posts) 2008/12/17 (Wed) @ 17:13

Yes, Tango’s method was how I used to do queries in Access when I first started out. At the time, I wasn’t aware that you could combine multiple steps into a single query. Once I became more familiar with Access, I realized that I could start combining queries. The great thing about Access is that you can create a new field in a query, and then reference that field in the same query.

I would certainly recommend creating seperate tables for the Hitting, Fielding, and Pitching tables in the BDB database, that combine the multiple stint players on the same team in the same year. All you have to do is group by yearID, playerID, and teamID.

One thing you need to be careful about when combining data from a table is null values. For instance, in Access “3 + NULL = NULL”. I get around this by filling in zeroes for the null values. However, there is probably a more efficient way to get around the null values.


#22    Rally      (see all posts) 2008/12/17 (Wed) @ 17:24

I gave mySQL a try when Colin put up a good post on using that, but I had some trouble (some of it my computer) and went back to Access.  In mySQL, there is a coalesce function that handles null values, but it doesn’t work in Access (perhaps there’s a function but it’s named differently?).  I just find Access to be very user friendly.

Tango’s right for beginners.  For me, I’ve got queries that link so many tables together the screen looks like a spiderweb.


#23    Tangotiger      (see all posts) 2008/12/17 (Wed) @ 17:26

Nz(HBP)

That turns nulls into zeroes.  That’s the equivalent of:

coalesce(HBP,0)


#24    Tangotiger      (see all posts) 2008/12/17 (Wed) @ 17:27

In Access, when you open up a query, right-click in the field name field, and you can click “build”, and then “built-in functions”.  That gives you the entire list of things you can do.


#25    Rally      (see all posts) 2008/12/17 (Wed) @ 17:39

Sweet.  Thanks.


#26    terpsfan101      (see all posts) 2008/12/17 (Wed) @ 19:43

I know these are garbage stats, but I have added Holds, Blown Saves, Inherited Runners, and Inherited Runners Scored to the list in #14.

I will take a look at adding some sort of baserunning category that deals with baserunner advancement on singles and doubles. I guess I could figure Baserunner Holds for outfielders while I am compiling the baserunning information.


#27    terpsfan101      (see all posts) 2008/12/19 (Fri) @ 01:52

I am also re-compiling the Postseason Batting, Pitching, and Fielding tables from the Retrosheet Postseason Event files (1903-2008). There is missing information all over the place in the postseason tables. For instance, there is no fielding data from 2005-2008.


#28          (see all posts) 2008/12/19 (Fri) @ 09:09

I just posted an article at StatSpeak http://statspeak.net/2008/12/retrosheet-game-logs.html which examines the two different versions of games tables, the ones from the Retrosheet game logs, and those generated by cwgame from the event files.

I offer some advice on how to load, modify and use both tables, with links to Tango’s DDL code for the cwgame output, and my own DDL for the Retro game logs.

Comments and corrections welcome, both here and at StatSpeak.


#29    terpsfan101      (see all posts) 2008/12/19 (Fri) @ 16:41

I actually used cwgame for the first time last night. Since there are no postseason gamelogs, I needed to use cwgame to figure out Wins, Losses, and Saves for pitchers, and Defensive GS for position players. I used Tango’s SQL that he provides on his Retrosheet Database wiki page to create the table. In Access, you need to change “varchar” to “text” in order to use Tango’s SQL’s.

The main difference between the Retrosheet gamelogs and those generated from cwgame is the number of fields. The gamelog files have 160+ fields, although many of them are redundant.


#30    terpsfan101      (see all posts) 2008/12/26 (Fri) @ 07:20

The data that I promised here should be ready by the end of the weekend. I keep getting side-tracked with a million other things that I am working on simultaneously. I will post the data to the BDB Yahoo Group in a zip file (even though I prefer the RAR format for archives). RAR has a much higher compression rate than the zip format, and the RAR format also stores recovery information in case the archive is damaged.


#31          (see all posts) 2009/01/12 (Mon) @ 11:38

Tango,

I sent you a zip file with the preliminary data additions that I have compiled for the Baseball Databank databse. When you find some free time, I would like to get your input on the data additions before I submit them to the Yahoo group.


#32    terpsfan101      (see all posts) 2010/03/20 (Sat) @ 02:09

I finally finished work on the first version of an enhanced Baseball Databank/Lahman database. The project took me 3 months to complete. I originally thought it would take only a month or so. There is still a lot of stuff I want to eventually add, like linear weights, WAR, fielding runs, etc… , but those additions will have to wait. This database is just the raw baseball data. Eventually I’ll do a write-up, with table definitions, data sources, and discrepancies. This is the most complete and accurate baseball database around. I corrected hundreds (maybe even thousands) of errors and discrepancies found in the stock version of the Lahman/BDB database. I provided a database shell (MS Access). You’ll have to manually import the tables. Here’s the link:

http://www.mediafire.com/?omc4yizmugz


#33    Ed D.      (see all posts) 2010/03/20 (Sat) @ 17:41

terpsfan, thanks for sharing your work here.  As a heavy Access-only user (and one that is now less sheepish about it knowing that I’m not the only one), I actually wish that all databases were shared in this format (i.e., csv tables with optional shell).  Even though I’m always grateful, it can be frustrating when great resources like this are shared only in MySQL format (though I am lucky to know people that can readily convert them for me if I’m really desperate).

As a long time Lahman user, my only concern in “switching” is that with Lahman I KNOW the next version is going to be there every year.  I don’t think that anyone doubts your intent, but this may be one of those things that requires a few years of consistent updates (and probably a good dose of public scrutiny from the kind folks who read this blog) before it sticks.  Good luck!

-Ed D.


#34    terpsfan101      (see all posts) 2010/03/20 (Sat) @ 21:32

I plan on updating the DB at the end of every season. Since this DB is based upon the Lahman database, I’ll have to wait until they release there update. I’ll also have to wait until Retrosheet releases their event files, usually at the end of November. So look for an update in early December.


#35    Nick Steiner      (see all posts) 2010/04/22 (Thu) @ 21:33

Okay so I’m looking at Terpsfans database and it looks excellent - much more detailed than normal BDB. 

But is there anyway to load the tables without manually creating 30 different columns for each table?


#36    terpsfan101      (see all posts) 2010/04/23 (Fri) @ 06:52

If you have MS Access, you can use the database shell. I have all the tables set up, you just have to manually import them.

I have the next version of the database ready to go. I changed the primary position groupings from (player, year, lg) to (player, year). This way you can sum an individuals stats by year and keep the primary position. There were a few hundred players who changed leagues and had different primary positions for each league. Also, I added some more awards (hypothetical MVP’s, Cy Young, and ROY from the ESPN Encyclopedia) and tweaked a few park factors. I’ll post the link to the new version later tonight. This will most likely be the last version I post until the end of the season.


#37          (see all posts) 2010/04/23 (Fri) @ 08:30

I’m the opposite of Ed.  I wouldn’t mind seeing this is a sql format.


#38    terpsfan101      (see all posts) 2010/04/27 (Tue) @ 04:36

Up late, my mind is racing and I can’t fall asleep. I remembered that I forgot to post the link for the updated version of my modified BDB/Lahman database:

http://www.mediafire.com/?m4mirwmizmd

You can view the changes that I made in this version in post #36.

I know that a readme is definitely needed for this database. Until then, let me point out a few important things:

- The Reached Base on Error (RBOE) category includes both reaching on an error and reaching on a fielder’s choice where no out is recorded. Reaching on a sacrifice bunt is excluded from this category. I list Reaching on a Sacrifice bunt under the RBSH category.

- Individual batter strikeouts from 1897-1902 were not recorded officially. Through newspaper research, Johnathen Frankel was able to find approximately 85-90% of batter K’s from this time period. The batter strikeouts you see in the batting table from 1897-1902 were taken from his batter strikeout databses. I don’t have the link handy right now, but I will post it if anyone is interested.

- The PKO column is Pickoffs. It excludes Pickoff Caught Stealings and Pickoff Errors.

- The PKE column is Pickoff Error Advance’s. It excludes Pickoff Caught stealings and Pickoff Errors where the baserunner was actually picked off (pitcher or catcher credited with an assist).

- Individual pickoff errors do not reconcile on the batting and defenisve side. Why? Because, I only assign individual credit on the defensive side when the pitcher or catcher are responsible for the pickoff error. Sometimes, the infielder covering the base where the pickoff error occurs is given an error.

- Postseason stats from 1903-2009 were figured from the Retrosheet event files. I spent a lot of time researching 19th century postseason statistics. I added statistics for the Temple Cup (1894-1897) and the Chronicle Telegraph Cup of 1900.


#39    terpsfan101      (see all posts) 2010/05/21 (Fri) @ 23:57

In the last version of my database I forgot to calculate plate appearances in the batting table prior to 1952. This version fixes that:

http://www.mediafire.com/?newmuztnwww

Here are some more tidbits about my database:

- Batter strikeouts from 1897-1902 are pro-rated. I didn’t mention this in my last post. For instance, if Honus Wagner has 15 recorded strikeouts in 75 games, and he played in 150 games, then he is credited with 30 strikeouts. It would be better to pro-rate strikeouts using at-bats rather than games, but Johnathen Frankel (the source of the batter strikeout data) didn’t count at-bats.

- There has been some recent discussion about blown saves and holds in the RetroList yahoo group. I didn’t calculate these statistics myself. I took the data from Baseball Prospectus, along with the inherited runner data. I don’t know what criteria BP uses for holds and blown saves.

- missPA is the number of plate appearances for which no play-by-play data is available. For pitchers I used missIPouts and missBFP. These are the categories which were calculated from play-by-play data:

Batting: RBOE, PKO, PKCS, PKE, GIDPopp, RBSH, IBB (1952-1954).

Pitching: RBOE, 2B, 3B, GIDP, GIDPopp, RBSH, IBB (1952-1954), BS, HLD, IR, IS.

Fielding Pitchers: PB, SB, CS, PKCS, PKO, PKE

Fielding Catchers: WP, SB, CS, PKCS, PKO, PKE. I did keep track of missIPouts for catchers, but didn’t include them in the database.

Teams: RBOE, batPKO, batPKCS, batPKE, GIDPopp, RBSH, IBB (1952-1954), RBOEA, PKCS, PKO, PKE, IBBA (1952-1954), BS, HLD, IR, IS, GIDPoppA, RBSHA. All entries for 2BA, 3BA, GIDPA, SBA, and CSA are complete. They were taken from the game log files.

-In the teams table, GP means pitchers used (games + individual relief appearances). PADJ_S is the park adjustment for run-scoring that doesn’t include homeruns. Here is the formula:

PADJ_S = sqrt((PADJ_R - HR% * PADJ_HR) / (1 - HR%))

Where

HR% = ((HR + HRA)/(R + RA)) * 1.4


#40    Tangotiger      (see all posts) 2010/05/22 (Sat) @ 06:55

Why would you pro-rate?  Why not simply show “missG_K” to denote missing games for strikeouts, and let the user prorate if needed.  You can’t mix-and-match like you are doing.

Similarly, I disagree with any attempt to “calculate” PA or CS (as others have done) and make it an official value.  Create a new column for those (and prorated K if you like).

This is critical.


#41    terpsfan101      (see all posts) 2010/05/22 (Sat) @ 13:08

Tango, MLB now uses Pete Palmer’s database for it’s official statistics. I know he pro-rated a lot of the 19th century RBI totals and estimated BFP when it wasn’t available. MLB officially recognizes these pro-rated and estimated stats.

“Similarly, I disagree with any attempt to “calculate” PA or CS (as others have done) and make it an official value.”

PA is simply AB+BB+HBP+SH+SF+XI and none of the additional CS data (1912 AL, 1913 NL, 1916 20+SB, 1919 Reds, and 1940’s Dodgers and Giants) is estimated or pro-rated. I don’t understand this comment. Did you mean BFP instead of PA?


#42    Tangotiger      (see all posts) 2010/05/23 (Sun) @ 09:01

PA=BFP .  I think it’s silly that we use different terms for the same things.  It’s HR and HR for pitchers and hitters.


#43    terpsfan101      (see all posts) 2010/05/23 (Sun) @ 10:54

Yeah it is kind of silly. MLB.com lists BFP as PA. Anyway, thanks for the input.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Aug 31 15:28
Fans Scouting Report: Update

Sep 02 15:44
The two uncertainties of UZR

Sep 02 15:17
Mail: rWAR v fWAR

Sep 02 14:59
Roger Federer

Sep 02 14:59
It’s hard to beat the crowd (Vegas in this case) no matter how smart you think you are

Sep 02 14:57
Could Rob Dibble have been a comp for Strasburg?

Sep 02 14:15
WOWY Teachers

Sep 02 13:37
Who’s Waldo?

Sep 02 08:36
Team Elin

Sep 02 01:19
Can someone tell me why Trevor Hoffman is still allowed to pitch?