THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, November 30, 2009

Retrosheet: Invalid Pitch Codes - as per Clem Comly validation rules

By Tangotiger, 05:32 PM

I posted this to the Retro group, but presumably others here may find this interesting.

***

There’s 553 pitch records that don’t correspond to Clem’s rules, that he noted last year.  I did my best to program those rules, but it’s possible that I either made a mistake, or that I misinterpreted Clem’s rules.

If you click on this link:
http://tangotiger.net/retrosheet/reports/invalid_pitch_codes.html

You will see the entire set, including the reason that the pitch code field is invalid.  For the 2009 seasons, there were only 19 invalid records, all of which failed for the same reason: the event was C/E2, and the last character in the pitch code sequence must be an N (and it wasn’t).

If I have made any mistake at all in my report, please let me know, so we can go over the rules and my code to make sure we got it right.  I think it’s very good that there are so few mistakes in the pitch code sequence to begin with.  It is, by far, the most labor-intensive of the fields to try to process, and with some 4+ millions records that has that field filled, that’s a tiny error rate.

Tom

***

No one reported anything back, so I will presume that it at least passed the sniff test.  If somebody is processing the pitch sequences for the 2009 season from Gameday or other sources, I’d be interested to know if we can just tack on the “N” at the end of each sequence for those 19 records.


SabermetricsData
#1    Mike Fast      (see all posts) 2009/12/01 (Tue) @ 02:09

If somebody is processing the pitch sequences for the 2009 season from Gameday or other sources, I’d be interested to know if we can just tack on the “N” at the end of each sequence for those 19 records.

Perhaps replace the last pitch code with “N” rather than tacking on additional pitch.  At least if I understand my Retrosheet pitch coding accurately.

Here are a few examples from the Gameday raw XML files:
http://gd2.mlb.com/components/game/mlb/year_2009/month_08/day_09/gid_2009_08_09_texmlb_anamlb_1/inning/inning_5.xml
http://gd2.mlb.com/components/game/mlb/year_2009/month_09/day_22/gid_2009_09_22_nyamlb_anamlb_1/inning/inning_3.xml
http://gd2.mlb.com/components/game/mlb/year_2009/month_07/day_21/gid_2009_07_21_anamlb_kcamlb_2/inning/inning_5.xml

MLBAM labels the final pitches as “Unknown Strike”.  Retrosheet appears to have done the same in many (though not all) of the cases in your list.  Presumably they should be recorded as “N” (interference) instead of “K” (unknown strike) by the Retrosheet code.


#2    Mike Fast      (see all posts) 2009/12/01 (Tue) @ 02:33

Btw, your list encompasses all 19 regular-season occurrences of catcher interference in 2009, so I don’t really have a example of “proper” coding from this year, if in fact the “N” code is what should be used.

I also find it interesting that Ryan Ludwick was involved in 3 of the catcher interference calls, Ellsbury 2 plus another in the playoffs, Ryan Freel 2, Hideki Matsui 2.  Is encouraging the catcher to interfere a repeatable skill on the part of the batter?

Carl Crawford, Edwin Encarnacion, Lyle Overbay, and Travis Hafner are other batters involved with 3+ catcher interference calls over the past 3 years.  Crawford is the leader with 6 (of 69 total calls in MLB during that time).


#3    Mike Fast      (see all posts) 2009/12/01 (Tue) @ 02:52

Also, certain catchers are more likely to offend.  Here are the catcher leaders over the last three years:

4 Brian McCann, Jeff Mathis, Gregg Zaun, Jorge Posada
3 Gerald Laird, Mike Napoli
2 Chris Coste, Chris Stewart, Eli Whiteside, Guillermo Quiroz, Ivan Rodriguez, Jamie Burke, John Buck, Miguel Montero, Paul Bako, Rob Bowen, Ronny Paulino
plus 25 other catchers with 1

And the batters’ leaders, which I sorta implied above but didn’t list explicitly:

6 Carl Crawford
4 Ryan Ludwick, Edwin Encarnacion, Hideki Matsui
3 Lyle Overbay, Jacoby Ellsbury, Travis Hafner
2 Andre Ethier, Darin Erstad, Miguel Tejada, Ryan Freel
plus 34 other batters with 1


#4    Tangotiger      (see all posts) 2009/12/01 (Tue) @ 08:08

Great stuff Mike!


#5    Kincaid      (see all posts) 2009/12/01 (Tue) @ 08:28

Dale Berra was the Babe Ruth of catcher interference.

I don’t know Clem’s rules, but I found a ball in play from a 2008 game labeled as ‘J’ once.  Not sure if that’s relevant to this thread at all.


#6    Tangotiger      (see all posts) 2009/12/01 (Tue) @ 10:11

Mike, the way to think of the “N” code is as a way to mark the PA as being completed.

So, an interference on the first pitch should be marked as:
KN

This shows that the pitch count = 1.  If we mark it only as
N
then there is no pitches thrown.

If we mark it only as
K
this means that the PA ended like that somehow.


#7    joe arthur      (see all posts) 2009/12/01 (Tue) @ 10:12

Tom,
one question and some comments.
Do you completely “reload” event files every year to pick up any corrections made for previously released seasons?

Message search doesn’t seem to be working very reliably at retrolist, so I cannot find the full history of the discussion there. It appears as though “Clem’s rules” may need to be consolidated from multiple comments, presumably across a couple of years. Can you post the assembled rules here?

I applaud you for trying to clean the data up, but I do think it is wrong to hide peculiarities in the data by silently correcting it, when there is the slightest doubt about the correction (as there is with sticking an N on the end of pitch sequences involving catcher’s interference). Programming tractability does not trump historical uncertainty.

Both David Smith and apparently Clem Conly have made statements about correct pitch coding which are out of synch with retrosheet’s official definitions of pitch codes: http://www.retrosheet.org/eventfile.htm
including these
.  marker for play not involving the batter
N no pitch (on balks and interference calls)

David recently stated on retrolist that N should not be used for catcher’s interference ("since a pitch was thrown"), and you quote Clem as saying that “A pitch string can end without the completion of PA when the inning ends
with a baserunner being put out or scoring the winning run (but won’t end
with . or >)”

I shouldn’t read Clem’s mind, but it may be that he thinks of ‘.’ as a continuation symbol when the plate appearance is interrupted by a non-batting event. But that’s not the posted definition… I’d disagree that there’s anything wrong with the 1st 5 errors you report.

And to me, the code N always made sense as a ball delivered to the plate which was not counted because of a supervening event - a balk is called during the delivery; interference occurred (and is accepted) so that the ‘result’ of the batter’s swing is ignored. So only some balks deserve an ‘N’, and when a pitch sequences involving catcher’s interference ends with F for foul, it is possible that this records the swing which was interfered with. in this case the “correct” sequence might involve substituting N for the final F, rather than “adding a missing N.” Or it could be true that the interfered pitch was not recorded at all, and adding N is appropriate… If you can’t make a reliable correction from other evidence, I’d argue for leaving it alone. Certainly for 2009 we do have the evidence, but each play needs individual review.


#8    Tangotiger      (see all posts) 2009/12/01 (Tue) @ 10:12

Kincaid: if that’s the case, it must have been corrected, because it doesn’t appear in my list that I linked to.


#9    Tangotiger      (see all posts) 2009/12/01 (Tue) @ 10:20

Joe:

The discussion at Retrolist was around Mar 7-9, 2009.  You can check the archives there.  The consolidated rules was as I forwarded to the Retrolist group this morning (hasn’t posted yet), and I will repeat it here:

Let me try to summarize:

A batter can complete his PA by striking out in which case the last pitch could be CKLMOQST

A batter can complete his PA by walking in which case the last pitch could be BIPV

A batter can complete his PA by HBP in which case the last pitch would be H

A batter can complete his PA by reaching on C interference in which case the last pitch would be N

A batter can complete his PA by putting the ball in play in which case the last pitch could be XY

A batter cannot complete his PA if the last pitch FR123 (or > or .)?

A pitch string can end without the completion of PA when it ends the game (rainout, etc.) which would be virtually any pitch code

A pitch string can end without the completion of PA when the inning ends with a baserunner being put out or scoring the winning run (but won’t end
with . or >)

Above doesn’t count events during the PA (subs, play on baserunner) when the next event is for the same batter.

Clem

So, I’m not too worked up on what the rules should be.  But I am worked up on being able to write an algorithm to find inconsistencies.  If the rule is that “N” is used to end the PA for event_cd = 17, then that’s what I look for.

This list that I generated of the 553 invalid records is exhaustive.  I’d love for it to be 0 by either having the rules modified or the data modified.

***

And yes, I completely destroy and recreate my database every year.  I have my instructions that I coded in sequence, so that I know I can recreate it at will.


#10    joe arthur      (see all posts) 2009/12/01 (Tue) @ 12:10

Tom,

I had found the March 2009 discussion, but it clearly referred back to earlier discussion which I couldn’t pull up in the message search.

I reiterate two points: 1) I disagree with Clem’s last rule - a pitch string should be able to end with . or > when it corresponds to an inning ending play on a runner. 2) I agree with the rule that pitch strings involving catcher’s interference should end with N - that is an anomaly and you are correct to identify it as such. I disagree with your proposal that an N therefore should be tacked onto the existing pitch sequences automatically. I am agreeing with Mike Fast’s interpretation in comment 1 about what the pitch code N means and disagreeing with your interpretation of it.

And to the extent you think anomalies should be eliminated by changing data, when more than one emendation of the data is possible given the existing evidence, then I disagree. We should keep the anomalies. Then it is open to different researchers to make different judgments about how to handle them.


#11    Tangotiger      (see all posts) 2009/12/01 (Tue) @ 12:17

Regarding the N: since it is used for balk purposes as well (and therefore means pitchcount=0), then we cannot ALSO use it for interference calls on its own.  In that respect, you need a different codeletter.

***

As for the anomolies, then just as we have UNKNOWN flags for outs and plays, then we should have one for pitch strings.  That’ll keep everyone happy.


#12    Tangotiger      (see all posts) 2009/12/01 (Tue) @ 12:18

Joe, my post 9 has the summary.  I wrote my algorithm based on those rules.  I don’t think the archives will give you much more insight.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 12 05:18
Reader Mail of the Day: Why do we need X years of fielding data?  And what about outliers?

Feb 12 04:55
Who is Jeremy Lin?

Feb 12 03:15
New PECOTA

Feb 12 02:42
Whitney Houston

Feb 12 02:23
Psst… wanna intern in Canada?

Feb 12 00:40
Clutch analogy

Feb 11 20:11
Fighting leads to goals?

Feb 11 19:55
Why do players get crappy caps?

Feb 11 19:12
Hero of the month: Brittney Baxter

Feb 11 17:59
MGL: Today on Clubhouse Confidential