Monday, November 30, 2009
Retrosheet: Invalid Pitch Codes - as per Clem Comly validation rules
I posted this to the Retro group, but presumably others here may find this interesting.
***
There’s 553 pitch records that don’t correspond to Clem’s rules, that he noted last year. I did my best to program those rules, but it’s possible that I either made a mistake, or that I misinterpreted Clem’s rules.
If you click on this link:
http://tangotiger.net/retrosheet/reports/invalid_pitch_codes.html
You will see the entire set, including the reason that the pitch code field is invalid. For the 2009 seasons, there were only 19 invalid records, all of which failed for the same reason: the event was C/E2, and the last character in the pitch code sequence must be an N (and it wasn’t).
If I have made any mistake at all in my report, please let me know, so we can go over the rules and my code to make sure we got it right. I think it’s very good that there are so few mistakes in the pitch code sequence to begin with. It is, by far, the most labor-intensive of the fields to try to process, and with some 4+ millions records that has that field filled, that’s a tiny error rate.
Tom
***
No one reported anything back, so I will presume that it at least passed the sniff test. If somebody is processing the pitch sequences for the 2009 season from Gameday or other sources, I’d be interested to know if we can just tack on the “N” at the end of each sequence for those 19 records.


Perhaps replace the last pitch code with “N” rather than tacking on additional pitch. At least if I understand my Retrosheet pitch coding accurately.
Here are a few examples from the Gameday raw XML files:
http://gd2.mlb.com/components/game/mlb/year_2009/month_08/day_09/gid_2009_08_09_texmlb_anamlb_1/inning/inning_5.xml
http://gd2.mlb.com/components/game/mlb/year_2009/month_09/day_22/gid_2009_09_22_nyamlb_anamlb_1/inning/inning_3.xml
http://gd2.mlb.com/components/game/mlb/year_2009/month_07/day_21/gid_2009_07_21_anamlb_kcamlb_2/inning/inning_5.xml
MLBAM labels the final pitches as “Unknown Strike”. Retrosheet appears to have done the same in many (though not all) of the cases in your list. Presumably they should be recorded as “N” (interference) instead of “K” (unknown strike) by the Retrosheet code.