Sunday, December 23, 2007
The Physics of Why Baseballs Do What They Do
By SirKodiak, with related work by Mike Fast
Buy The Book from Amazon
By SirKodiak, with related work by Mike Fast
That was written by SirKodiak, who is a regular poster here.
Good review, MGL!
That piece was written over a year and a half ago. It was intended for readers that play a game called Baseball Mogul. It’s initial purpose was to clear some things up for players of the game as many were asking questions about pitches. The game covers historical baseball from 1900-current, so some of the statements will be incorrect (such as fastball speeds) when moved into exclusively a current context. Some of the statements are from merely personal observations. So the context is important, as the article was not written as a sabremetric analysis, nor as an article for the general public.
In its current form it is certainly lacking as a sabremetric piece. I had actually planned to edit it for this crowd prior to Tango posting it, and still plan to edit it (probably not until the new year), but the reason for it being posted is to stimulate the discussion and study of pitch types. Hopes are that if someone does not agree with something in it, that they show why it is wrong, what the right answer is, and how they came to that conclusion; and then the article can be edited to show the new information, link to the work, and credit the article and the author.
There is no ego attached to the article. I have always asked for criticism of the article, even when it was just for the Mogul players. The hope is to make it correct, credit those that correct it, and spur on work in the area of pitches.
Hopefully it can do that.
As for the fastball speed, it depends when it’s measured, be it as it crosses the plate, or from 50 ft out.
IIRC, in the 2007 BJ Handbook, the top 10 fastball pitchers in the NL barely exceeded 90 mph, and I think in the AL it was 91 or 92 was the 10th place. This is in stark contrast to the numbers we’ve been seeing from John Walsh.
Basically, it depends.
Tango brought up something I had never really considered before. What’s the best point to measure velocity? I’ve heard of the leaving the pitcher’s hand/crossing the plate thing before, but how much slower is the velocity of a fastball from release until it crosses the plate? And is this a number that varies among pitchers? Does this sort of thing really matter as long as we measure from the same point for everybody?
Good review, MGL!
Thanks, although I did not realize that I was writing a review!
Did I write something inaccurate or express an opinion that you disagree with? If so, feel free to correct me or dispute my opinion.
Tango is right that it depends on where (when) you measure the speed. IIRC, pitch f/x measures at 50 feet from the front of the plate. I assume that on TV they measure from around the same point or even closer to the pitcher.
There is no standard and it does not really make a difference. It would be nice of course if there were a standard. Given that we are all used to what we see on TV, I think the standard should probably be whatever they do. But, as long as we (researchers) know from where a pitch is being timed and what the reduction is per unit distance, we are fine.
From what I recall, at least one of the pitch f/x guys wrote several articles talking about how much speed a pitch loses on the way to the plate. I recall something like 2-5 mph, depending on initial speed (is it a fixed percentage, depending on the conditions?). Obviously it depends on the density of the air of course. I don’t think it depends on what kind of pitch (spin), but I am not sure.
If the top 10 pitchers in the BJH barely exceeded 90 mph, then he is using some data that I am completely unaware of. On TV, the top pitchers are in the high 90’s and occasionally 100+. On pitch f/x they are also in at least the mid to high 90’s.
Tango, when you link to an article or other piece, at least give us an idea where it is coming from, who the author is, etc. Honestly, if Sir Kodiak was offended, I have to say that it was at least partially your responsiblity. I guess I could have googled him or the article, but the least you could do is give us some reference when you link to something, especially if it is linked to your web pages. If I sound a little peeved, I am. I appreciate all the work you do, and I have mentioned this before, but sometimes, I have no idea what you are talking about in your posts and links. Another one was your link/post on the “blog contest” (there are many others). It took me at least 10 minutes to figure out what the heck you were talking about. Please try and be clearer sometimes. Maybe it is just me and my old age (approaching 50).
And I apologize to Kodiak if I was too harsh in my comments.
Thanks, although I did not realize that I was writing a review!
andIt is a little simplistic at best, and somewhat inaccurate at worst
I kind of lost interest in reading the rest of the article.
It might be a decent read for someone with little knowledge/experience in how pitches and pitchers work…
sounded like a review to me. I guess I should have said something more like: “You are accurate, MGL, in your assessment of the article if it were meant for this audience, but the article was not put in context yet, nor was it meant for this audience as it was written. Here is the context it was written in and why I was interested in getting it out once I had updated and edited it for consumption of this audience”. Sorry about that. I could not be offended nor find your words harsh when I knew that you did not know (and could not know) the context in which
the article was written.
I would dispute that elapsed time per game vs. errors is an accurate way to tell if defense is hurt by walks/wildness. It just seems to me that elapsed time just has too many variables that are not related to walks/wildness: Total plate appearances (by both teams), pitching changes, time between pitches, etc. Also, range is not represented. That being said, I must say my saying that wildness/walks can lead to lackadaisical play is purely based on observing defenders standing flat footed, whistling, staring into the stands, etc. in such situations, not statistical analysis.
That being said, I hope this article (at least in its updated/edited form) will stimulate the discussion and study of pitch types. If anyone does read the whole article, please keep the context I provided in post #3 in mind.
As far as average speed of fastballs goes, I think that it is a perfect time to bring up the importance of classifying pitches. To say what the average speed of a fastball is, you have to define fastball. Is it only 4-seamers? 4-seamers and 2 seamers? 4-seam fastballs, 2-seam fastballs, cut fastballs, split-finger fastballs, and sinking fastballs?
One note about the article - I’d classify changeup as faster than a curveball on that spectrum.
Regarding the “standard” of where in its path to clock a pitch… I think it’s going to have to be as close to the beginning as possible. Not for any mathematical reason, but because people like to see big numbers. Zumaya wouldn’t be as interesting if his fastball was measured at 94mph (when it crossed the plate). He hits triple digits, and I think that’s half the excitement. So even if pitch fx is currently sort of setting the standard at 50ft from the plate… there’s going to be some ballpark gun that aims 60ft from the plate, clocks Zumaya at 103 instead of 101 like pitch fx… and that’s what they’re going to talk about on sportscenter. Same reason we measure home runs in terms of how far they “would have” gone if they landed on flat ground - it’s just a bigger number than how far it actually went before someone caught it.
Actually, side note: Wakefield’s knuckler is around 69mph (ballpark gun) if I remember correctly. I swear his curveball is around 65mph (um, before it leaves the ballpark). I wonder if this means curveball should be lowest on the spectrum?
I think we should really measure velocity not at a specific point, but as distance from release point to home plate divided by time from release point to home plate. From some of the articles written on pitch f/x, I get the impression this is something that could be calculated (or at least from 50 feet from home plate, rather than release point).
One more thing, and I apologize that this is totally unrelated to baseball. Let’s say the average historical high temp in Boston is 42* in December. And let’s say that the weather person, using their best predictive abilities, predicts it’s going to be 50* on a given day in December. Does the idea of regression to the mean state that it’s more likely to be less than 50* than over 50*?
Mike, you are misunderstanding the regression to the mean concept. When you make predictions you use regression to the mean in your prediction, not after the fact.
For instance if the temp is 50deg one day what do you predict the next? Well if the average is 42deg then you may decide to regress the 50deg 50% to get 46deg forecast.
In anycase I’m not sure the example holds as weather isn’t random statistical phenomena. You have very good visibility as to what type of weather is coming which will cause you to alter your prediction from the straight mean.
What you may do is to say that when the type of weather system xyz passes through the temp is usually 55deg, however today the temp is 48deg, which is colder than usual so I’ll regress that 48deg to 50deg for tomorrow’s prediction ... or something like that.
Pitches lose roughly 10% of their velocity from release to the plate due to the force of drag. I am not aware that this varies from pitcher to pitcher, although it does vary slightly due to weather conditions, altitude, and the speed of the pitch (faster pitches experience a little more drag).
PITCHf/x measures velocity 50 feet from home plate. They standardized on this value (after trying 55, 40, and 45 feet) because it correlates best with radar gun readings that people are used to seeing.
As long as we’re measuring at the same point for everyone, it shouldn’t matter. The second-order differences are well within the noise margin on the measurement at this point.
Yes, another reason to measure as close to the release point as possible (50 feet is fine) is that weather and altitude will vary from park to park and day to day and the decrease from release to plate is not pitcher-specific (other than the speed of the pitch), as Mike said.
And yes, a changeup on the average is faster than a curve. I think one of the pitch f/x guys showed that as well. And anyone who pays keen attention to pitches on TV knows that. Obviously that varies with individual pitchers.
I don’t exactly know how weather forecasting works, but I would guess that the “error bars” around a forecast temperature (or any weather forecast item) are not symmetyrical and that they are larger towards the mean for that day. Maybe not though. That does not mean, however, that the “mean forecast temperature” should be regressed toward the average for that day. If it did, then the forecaster would do that BEFORE he game the forecast, right?
It is a little like in blackjack card counting though. If you are counting cards, there is always the likelihood that you made a mistake during the deck or shoe. Let’s say that you think the true count is around +2 which means that you should stand on a 12 versus a dealer 3. Basic strategy says to hit that of course. At an exact true count of 2 or more, you are supposed to stand. However, even if you think the true count is around +2, you should still hit it. That is because it is more likely that if you made mistake in your count that the real true count is closer to 0 than to “higher than +2”. It might be the same thing with weather forecasting. Even if the conditions are such that in the past that means a temp of 42 coming up, it might be that the weather forecaster chance of making a mistake in the direction of the temp being greater than 42 is larger than him making a mistake in the direction of it being less than 42, if the average temp on that day is 50. He should probably give a forecast of like 42.2, but he doesn’t. Or maybe even 43 but he sticks to his model without considering the assymetrical chance of him making mistakes in his model or even his calculations (like a “typo").
Yes, the relative speed spectrum will have to be adjusted. Splitters will have to be added, obviously to the left of changeups, but to the right of which pitch?
Another question I have is about the speed of curveballs. If curveballs are split between the overhand (12-6) variety and the more ‘slurvy’ variety, do they fall on opposite sides of the changeup? Perhaps 12-6 curves go on the right side of changeup and regular curveballs on the left? It seems so to me, but it may be purely observational bias as I tend to notice the 12-6 curves that have a bigger ‘loop’ more than others.
I think that overhand curves tend to be slower than slurvy ones, but overall, curves are slower than changeups.
The speed hierarchy (from closely watching BB for many, many years) is:
4 seam FB
2 seam FB
cutter
slider/splitter (I don’t know which is slower)
changeup
curve
Re: the confusion of the author of the piece. The link above did say “By SirKodiak”, so I thought that was sufficient. However, I have added a “Copyright Sir Kodiak” on the page itself, so that should clear that up.
I was referring to a little explanation as to the source of the piece, not just the name of the author.
The Mike Fast article linked to in the original post was very, very interesting.
I am not exactly sure why we need to have such a great naming system, although it does facilitate some kinds of research. However, if a researcher has all the data he needs on pitch speed and movement, he can classify the pitches any way he wants, or not at all, I suppose.
It’s a little like trying to classify every type of batted ball. I don’t think it is that necessary. What we really want for batted balls, which is NOT yet available, is hang time, height, distance, etc. It is not that important what to call them - liners, fliners, pop flies, etc. We already have the data we need on pitches, so I am not sure we need to obsess so much on classifying them. Any reasonable classification should suffice for most research. In fact, for some research we DON’T necessarily want to use pitcher-specific classifications.
I am a little troubled by the possibility of errors in the pitch f/x system itself. I think there needs to be more testing and QC on that.
It is interesting that when you are watching a game on TV, it is not that hard to figure out what kind of a pitch a pitcher throws. After the fact, it is a lot harder for a computer and a programmer to classify the pitches. There are some things that the human brain does a lot better than a computer. This is one of them.
The nomenclature just puts a “human face”. Otherwise, I agree with you.
Same with the batted ball information. The work by Greg Rybar[fill in rest with some combination of czyk… my wife is Polish, and I always kid that when I hear a name, how many k’s and z’s are there] in the THT08 Annual is an example of great work in this regard.
Here is Dan Fox’s take on why “it is useful to strive for classification.”
Briefly, he says:
1) classification is helpful because it gives us a common nomenclature to reference
and
2)pitchers already intentionally throw pitches of certain types, so there is a definite distinction being made
Interesting thread.
As to the speed hierarchy of pitches: after studying lots of pitches, it seems to me that changeups are definitely thrown harder than curves, even slurvy ones. In fact, the average changeup is really not much slower than the average slider.
Another comment: if we are measuring speed at (or near) release and if the only difference between the 4-seam and 2-seam fastballs is the orientation of the seams, shouldn’t those two pitches register at (virtually) the same speed? In other words, the 4-seamer experiences less drag (according to conventional wisdom, although I believe some laboratory measurements were unable to confirm it) than the 2-seamer, but if we measure speed at release point, the drag has not yet taken effect. Right?
Of course, 2-seamers may well be thrown differently ("turning it over”, or whatever), which could change the speed on release.
Regarding the need to classify pitches according to the traditional pitch types:
I actually believe this helps us to think about pitching. There are over 300,000 pitches in my pitch-f/x database and any method of breaking down that big number into subsets is worth considering. I think Dan Fox made an important point when he said the pitch type reveals pitcher intention—that is very different from batted balls, where the batter does not try to hit the ball with different spins, etc.
Research will surely reveal alternative methods for classifying pitches (using location, e.g., which has been rather neglected in most of the pitch-f/x research thus far—including my own). In the end, the “best” classification will depend on the goal of the analysis. If you’re studying pitcher/batter game theory, traditional pitch types might be best. If you’re studying BABIP, maybe location (or ball-strike count, or whatever) would be better.
In any case, there is a lot of work to do.
Yes, I agree that it is helpful in many ways (not the least of which is putting a face on it, as Tango says), but two things:
One, any classification system is going to be arbitrary in terms of the cutoff points. I assume that there are pitches that do not naturally fit into one cluster or another, no matter what alogorithm we use. If a certain pitchers or even all pitchers’ pitches happen to be more clustered than others or than I think, all the more power - it makes it wasy to classify them.
Two, it does not matter what we call pitches. We might as well call them the A, B, C, D, E, and F pitches. I think we are too hung up on whether a pitch is a slider or a cut fastball. It makes no difference. We are going to classify pitches according to spin/speed or whatever we want. I assume and it would be nice if we classify them according to intention, but that is never going to be exact even if know for a fact what pitches a pitcher intends to throw and does throw on a deliberate basis.
The best we can do is classify all pitches from all pitchers into a number of categories based on the number of “intentional pitches” pitchers throw, as far as we know. This is going to based on speed and spin (break) once we establish the categories.
For individual pitchers, we can do the same, but use as our categories the pitches that we think that pitcher intentiuonally throws. If we don’t know that from scouting reports or from the pitcher himself or some other individual, then we have to infer it from the relative speed and break of all that pitcher’s pitchers, perhaps also comparing them to “known” pitches from other pitches.
Let’s do that and get on with it. There is a lot of fantastic substantive research to be done. I would not get too bogged down in this categorization/nomenclauture stuff, unless in the course of the research, it seems to become necessary.
On the other hand, if someone wants to devote their time to this issue, that is fine with me, as long as other guys are working on the good stuff.
Some things I want to know from the pitch f/x data (although some of it can’t be done until we get more data):
(Maybe we can start a thread on “a pitch f/x research wish list.” Tango?)
1) After surgery or injury, what happens to a pitcher on the average? Lose velocity? Lose command? Lose stamina?
2) What is the effect of velocity, sort of like age? Take each pitcher and look at the run value of a fastball are various velocities to see the average change in run value per mph of fastball.
3) Why some pitchers are high HR pitchers and others are low HR pitchers? Is it just that one pitches low in the zone and the other pitches high? Or is it a higher or lower percentage of “mistake” pitches in the middle of the zone? Is it that one naturally pitches high in the zone and the other pitches naturally low in the zone, or is it that one throws more “mistake pitches” up in the zone than the other one?
4) For pitchers who have good years and bad years, why is that? Is it fluctuation or is it something fundamental that they are doing differenelty, or can we even tell the difference?
5) Do older pitchers do different things fundamentally than younger ones? More or less experienced same question.
6) Aging curve for fastball speed, control, etc. (need at least 2 years of data for that, unless we want to split up the year into 2 halves and call the second half an “older pitcher by 3 months").
7) Do pitchers in general get tired as the season progresses and what does that look like in terms of speed, location, spin, etc.
8) Can we infer from the data why some fastballs, curves, changes, etc. are better than others?
9) What makes a pitcher with good stuff bad and one with mediocre stuff good?
10) How important is mixing up pitches at various counts? Against various batters? What is the balance between needing to mix up pitches and how good a pitchers stuff is (IOW, if you have an unhittable pitch even if the batter knows it is coming, like a Rivera cutter, do you need to throw anything else)?
11) Which pitchers are good and bad at mixing up pitches?
12) WHich pitchers seem to be optimal in terms of mixing up pitches according to game theory?
13) What is the proper mix of pitches are various counts and game situations and which pitchers are better than others?
14) Can we improve a pitcher’s performance just by analyzing some of these things and telling him what he is doing wrong (such as throwing too many or too few of certain pitches or in certain locations in certain situations)?
I think there are many more issues I would love to see looked at. Personally, I would start with “all pitchers” as a whole and work from there.
Any complaints as to this for a ‘relative speed spectrum’?:
4-seam fastball > 2-seamfastball/sinker/cutter > slider > splitter > changeup > 3/4 curveball > 12-6 curveball
I was thinking that perhaps ‘relative spectrums’ based on data might be useful for (ideas from Mike Fast’s work above):
*spin direction
*spin magnitude
*tendency to throw to same -> opposite side batters
perhaps vertical pitch movement and horizontal pitch movement spectrums would be useful (both with gravity and without)?
perhaps location spectrums in various forms would be useful as well?
On page 400 of the BJ08 Handbook, the fastest fastball in the NL is Penny at 93.4. Number 10 is Hudson at 90.9. The 10th lowest is 87.4. So, it’s easy to see that the average FB in the NL is around 89mph.
In the AL (the better league), page 394, gives us 95.6 for King Felix, 92.9 for CC as #10, and the 10th lowest is 90.1. The would put the average FB in the AL at 91.5mph.
So, the average FB in MLB, according to this book is around 90mph.
Tango/26
Those look like starters only, right? Lots of relievers have faster FBs.
FWIW, I have Penny at 94.0, Hudson at 91.2, Felix at 97 and CC at 93.9. So, it looks like the pitch-f/x values are coming in around 1 mph higher than the BIS data (which I believe simply come from TV broadcast radar guns.)
YEs, min 162IP.
PErhaps then it means their numbers are from 40ft out?
Tango/28
I suspect that BIS can’t tell you where the pitches are measured from. As John W said the speed data are from the radar guns.
These have inherent inaccuracies as shown here
http://www.hardballtimes.com/main/article/zoooomaya-and-speed-guns/
Also it is impossible to precisely determine the exact pick-up point as it depends on gun positioning, which varies (slightly) in each park. However, I think it is safe to assume it will pick up slightly closer to the plate than the pitch f/x data that is typically readjusted to, say, 55ft
For the record, the numbers I quoted for pitch-f/x in #27 are all adjusted to 50 ft.
A decent rule of thumb is that pitches lose 1 mph due to drag for every 5 or 6 feet that they travel.
The exact number depends on speed of the pitch, atmospheric conditions, etc., of course, but the rule of thumb will get you close if you want to adjust for pitch speeds measured at different distances.
Of the 4 data points that compare the Walsh numbers to the BIS numbers, the gap is 0.3, 0.6, 1.0, and 1.4 mph. Since Walsh is basing all his numbers at exactly 50 feet, and if we use the drag conversion of Mike/31 (5.5ft = 1mph), that implies that BIS is using 48.35 ft, 46.7 ft, 44.5 ft, 42.3 ft, respectively, as the average distance for its pitchers.
Clearly, there is a bias in the TV/radar gun. At the very least, BIS needs to come up with a TVradar Park impact number. Since we know the real numbers from F/X, this should be a snap.
Tom/32,
Keep in mind that the pitch-f/x data is not perfect either, so we don’t really know the “real numbers”.
Mike has looked at this (and Josh Kalk, too) and found that pitch speeds can jump around an mph or two from start to start. We have also seen that some fraction of pitch-f/x pitches simply seem to be mis-tracked.
Plus, there is a raft of other possibilities for
discrepancies in what I get and what Bill James gets. Our pitch classification methods will surely be different (I’m curious to know how BIS classifies pitches—I bet it’s based on a combination of pitch speed and scouting information, with a pinch of video-scout (isn’t that a great term?) observation throw in).
There are so many variables here, I think we should all take these speed numbers with the appropriate “error bars”, which should probably be at least 1 mph, maybe 2.
Great point about pitch classifications. It’s kinda like saying how far do HRs go. The minimum distance is the fence, so you can actually have an increase in the number of HR by having a decrease in its average distance. (Say for example, you bring the fence in 10 feet. Now, all the warning track FB become barely-HR, which sets the overall mean lower.)
If you misclassify the low-end fastballs as something, all that you have left is the upper-end pitches. After all, you will never classify an upper-end speed pitch as anything other than a fastball.
Rather than relying on “fastball” speed, why not instead take the top 25% or top 50% of pitches thrown by speed, and call that a pitcher’s “fastball” speed. Or, more accurately, “top-end speed”. You can say “half the time, this guy’s pitches average 93.4 mph”.
This may also allow a fairer comparison between starters and relievers. Presuming that there’s less spread in fastball speed for each reliever (say, Mo throws his 94-96mph, while CC will throw his 90-96 mph), by focusing only on the top end, say the top 25% of all pitches thrown, then maybe Mo will be 95-96, while CC will be 94-96.
This is why I feel so strongly about defining the pitch categories, even if it is only within the individual studies/articles. When my article was written, it was for a game where ‘fastball’ was a distinct pitch, but in studies/articles we have no idea about what the author means exactly by fastball or any other pitch unless there is a definition. If a standardization is attempted, at least it may encourage authors to define pitches better, or perhaps give authors something to easily express their own definitions against.
John/22
According to Alan Nathan there should be a small effect on drag based on seam orientation, on the order of 1 mph or so. I have looked for this effect in the PITCHf/x data, and I have not been able to find it. Either my 4-seam/2-seam classification is off just enough to muddy the data, or the effect is smaller than believed. Separately, I do believe that the grip/release on the two-seamer makes it a little more difficult to throw the pitch quite as hard as with the 4-seam grip.
Tango/34
A big concern I’d have with that method is that it would expose a problem with the PITCHf/x data that tends to be minimized with other methods. There are a few parks (Seattle, Toronto) that had speed readings higher than other parks. The data had various problems early in the year. Taking average fastball speeds tends to minimize the effect of games with high outlier speeds for most pitchers, although even then its effect is noticeable for pitchers with lots of data in those parks. Taking just the top end puts too much emphasis on the shoddy speed data.
Josh Kalk has tried to correct for some of these data/park problems, but I’m not convinced he’s gotten there.
Take a look at Huston Street’s average fastball speed by game:
91, 92, 89, 91, 89, 91, 93, 90...out with injury...88, 92, 93, 89, 89, 90, 90, 91, 89, 90, 91, 91, 91, 89, 93, 93, 90, 91, 90, 90, 89, 91, 90, 93, 92, 91, 91, 88, 90, 90.
Are we to believe that his top fastball speed is 92-93? Or does he really top out at 91, and the 92-93 readings are from camera systems that were measuring too fast? I don’t know the answer.
Mike/37, re: Tango/34:
In that case, we should park-adjust the numbers, before applying the method I describe. It’ll at least reduce one of the uncertainties.
Tango/38
That is what Josh Kalk has done. I don’t believe it works, though, not with our data set from 2007.
There are not consistent “park factors” for the data. I and others observed that Seattle and Toronto were high, but I think Sportvision fixed that later in the year. Alan Nathan told me that they recalibrate the camera systems between home stands. Josh Kalk looked for that in the data and didn’t see it, but I’m convinced it’s there. I believe the assumption that there is a consistent park factor is an erroneous one.
Maybe the “park factor” assumption, even if it’s not correct, gets us most of the way there and would be useful, but I’m suspicious of doing something to the data that we know doesn’t fit the established facts. I have seen cases in Josh Kalk’s data where I believe his park factor adjustments have mucked up otherwise good data, and I haven’t seen an overwhelming number of cases where I believe it’s helped clean up the data significantly. Maybe 2:1 good:bad and mostly agnostic, but I don’t know if I’d call that trustworthy.
MGL/24
There’s a lot of good meat in your post which I am trying to digest.
Re, “I assume that there are pitches that do not naturally fit into one cluster or another, no matter what algorithm we use.” I do not find that statement to be generally true. I would say that for about 30% of pitchers, every single one of their pitches can be assigned very accurately to the correct cluster. For another 40% or so of pitchers, nearly every one of their pitches (~99%) can be assigned quite accurately to the correct cluster. For another 20% or so of pitchers, you can classify 90-95% of their pitches accurately to a particular cluster. Finally, there are maybe 10% of pitchers for whom it’s tough to differentiate two or more pitch types and for whom the accuracy of the clustering is difficult or impossible to get past 80% of their pitches. These are rough approximations based on my experience looking at the data, so I’m sure they’re not completely correct, but if you assume they are close enough for government work and add them up, you find that we can accurately classify about 96% of all pitches if we work at it.
Re “it does not matter what we call pitches.” I agree, but only to an extent. If we have accurately categorized them, it doesn’t matter what label we put on the category post-facto. However, if/when I take your advice before beginning analysis, I find that I miss things in the data. For example, John Smoltz throws both a split-finger pitch and a changeup. From a quick look at the data they will look like one pitch, and you can say, “Ah, he’s got a changeup. People say he has a splitter, too, but it looks like they move the same, so it doesn’t matter whether I call it a changeup or a splitter. Objectively they are much the same; therefore, the name is irrelevant.” This is pretty much what I did on my first pass analysis of Smoltz. However, when I took a closer look at the data, it became clear to me that the splitter was thrown 2-3 mph faster than the changeup and with a spin direction that was inclined by another 20-30 degrees or so. I wasn’t able to perfectly separate the two pitches, but I was able to find that they did have distinct characteristics, not just in speed/movement, but also in how Smoltz used them with the hitters.
Re classification “let’s do that and get on with it.” The million-dollar question is how to do that. I take it that you are saying you are agnostic to the method and the accuracy of the classification? Is 50% accuracy good enough? I think I could easily do that. Getting to 80% accuracy would require a lot more work but is certainly doable. Is that good enough? Josh Kalk is pretty close to that, although his method/data isn’t public. It would not be trivial for me to reproduce what he’s done, but I could try. My goal has been to get to the 90-95% accuracy level, but that’s taking a lot more work than I expected. I guess my question is, what is good enough? Do you want your fastball data to include 100% of four-seamers and two-seamers, 80% of cutters, 30% of splitters, 20% of changeups, and 10% of sliders? Or would you rather have a tighter-defined fastball data set that included 70% of four-seamers, 50% of two-seamers, 30% of cutters, 10% of splitters, and 5% of changeups? Are either of those data sets good enough to do studies about fastball velocity or pitch mix? Or do we need to improve our pitch classification methods until we can get fastball data that includes 99% of four-seamers and two-seamers and 10/90% of cutters (depending on whether you want include/exclude cutters from your fastball data)? I’ll deal with your 1-14 wish list in a separate post, but it seems to me that most of the items on the wish list depend on accurate pitch classification.
Some thoughts on the MGL/24 wish list:
1) Do we have injury data for pitchers? If we had that and good pitch classification, this one would be pretty easy to answer. I suspect at this point the comprehensive injury data is the stumbling block. Without it we are working anecdotally. Speaking of anecdotes, based on my analysis of Smoltz, one other thing that can happen when a pitcher is nursing an injury is that he can stop throwing or almost stop throwing a certain pitch (with Smoltz, the splitter; with Street, the changeup).
2) I don’t think we have enough data to do by pitcher, but maybe we do. If we have 500-800 fastballs for a given pitcher, is that enough to separate them into 1mph bins with 100-200 pitches in each bin? Do we need to control for the fact that pitchers lose velocity as a game progresses, and we know that hitters perform better the second and third times they see a pitcher, or is that part of what we’re looking for in the first place?
3) I have wanted to look at this and have not gone around to it. By “mistake” pitches, do you mean all pitches in the middle of the zone? Someone else, (SirKodiak?) mentioned looking for hanging breaking balls in the data. I haven’t figured out how to do this. Despite having some anecdotal identification by pitchers/catchers of a few hanging breaking balls, I was not able to identify an peculiarities for those pitches in terms of speed/movement.
4) This is an excellent question, and one we can hopefully begin to tackle next year.
8) One thing I have noticed is that if a pitch is not very good, a pitcher doesn’t use it very often. I guess that’s obvious, but it was an interesting observation for me.
9) How do you define stuff? Speed and movement? And then separate from that would be location, pitch selection, and consistency/command?
10-13) I’ll let Joe Sheehan handle these.
MGL/24
Any truth to the rumor that you’re going to be tackling some of these yourself? [g]
Then we need a park-date factor (if not a park-date-hitter-timeUp), which certainly makes life more difficult.
Joe Sheehan has a new article out that deals with question #5 from MGL’s wish list in post #24.
http://baseballanalysts.com/archives/2008/01/old_man_river.php
And a followup by Sheehan, to control for the speed of the fastball, within each age group:
http://baseballanalysts.com/archives/2008/01/diff_groups.php
Nov 21 17:29
Sabermetric Moves of the 2009 Pre-Season
Nov 22 06:40
The New Triple Crown
Nov 22 06:24
Chance of Scoring by Base/Out, Retrosheet Years
Nov 22 02:48
How good are the Fans in evaluating fielding?
Nov 21 20:13
Runs Produced
Nov 21 19:27
Marcel 2009 is here
Nov 21 16:43
Nate Silver: hero to interviewers
Nov 21 10:57
New BBTN
Nov 20 20:34
ABSO-lutely… not!
Nov 20 19:23
R.I.P. Tom Boswell, sabermetrician; P.A.L.L.(*) Tom Boswell, human being
Tango, where does this come from and who is the author?
It is a little simplistic at best, and somewhat inaccurate at worst.
Most pitchers throw their fastball in the 86-89mph range. Ones that throw it 92mph+ are usually considered hard throwers, whereas some are fine pitchers with a 84mph fastball.
Most pitchers throw their fastballs a lot harder than that (86-89). I think that one of the pitch f/x guys had the average fastball around 91 or 92. There are almost zero “fine” pitchers who throw their fastball at 84. In fact, I am not sure there are any. Moyer may be the only one (other than a few oddball relievers, like Bradford) who throws that slow, and I don’t think he is “fine” anymore.
1) Not being ‘wild out of the strikezone’
This leads to walks, and walks are not only a free pass, but they don’t give the fielders a chance to help and the fielders tend to become lackadaisical.
I don’t think we know whether fielders “become lackadaisical” when pitchers are wild and/or walk a lot of batters. I think we suspect that that is NOT the case (at least according to one study that looked at elapsed time per game versus error rate), assuming that lackadaisical means that they make more errors and/or their range is reduced.
I kind of lost interest in reading the rest of the article.
It might be a decent read for someone with little knowledge/experience in how pitches and pitchers work…