THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Friday, November 04, 2011

Is this line linear or polynomial?

By Tangotiger, 03:28 PM

y = -243.83x^4 + 478.68x^3 - 170.49x^2 + 14.134x

image


#1    ElBonte      (see all posts) 2011/11/04 (Fri) @ 15:37

I’ve yet to find a line that isn’t linear wink

Sure looks like two linear segments to me, though.


#2    Tangotiger      (see all posts) 2011/11/04 (Fri) @ 15:39

I updated the image to remove the blue line, so as to not bias you.


#3    Crazy Crabbers      (see all posts) 2011/11/04 (Fri) @ 15:45

Well I would have to say that is polynomial.


#4    Tangotiger      (see all posts) 2011/11/04 (Fri) @ 15:49

Now, what if you focus on the part of the line between the .450 win% and the .650 win%?


#5          (see all posts) 2011/11/04 (Fri) @ 15:57

That’s a polynomial of degree 4, since it has a nonzero coefficient in front of the x^4. But it is a nice approximation to a linear function from .6 to 1.0.

However, if you widen to domain to something more like x=-3 to x=3, you’ll see it’s definitely a polynomial.

This is the same idea that drives Taylor polynomials/approximations.


#6    Tangotiger      (see all posts) 2011/11/04 (Fri) @ 16:02

So, can’t we just say that there’s a polynomial relationship between wins and salary, but at the points that we are interested in (win% .450 to .650), that it is about as close to a straight line.

And so, just accept that wins and dollars are linearly linked, at the points that we are interested in.

Furthermore, by “resetting” the scale, such that we subtract .400 win% from every point, then from +.000 (above reset point) to +.200 (above reset point), we have a straight line?

Therefore, drawing a line from +.000 to +.200, we have a straight line.  Hence, dollars to wins above replacement (WAR) are linear.


#7          (see all posts) 2011/11/04 (Fri) @ 16:03

Not sure what you’re really asking.  Obviously, the curve is a polynomial.  I would guess that it is probably a 4th order polynomial (:.  If you’re asking whether there is some region over which it is linear (i.e., slope is constant), it sure looks pretty linear between 0.6 and 1.0.  Most definitely not linear between 0.45 and 0.65.  But maybe I misunderstand your question.


#8          (see all posts) 2011/11/04 (Fri) @ 16:05

My message crossed Bo’s in flight.


#9          (see all posts) 2011/11/04 (Fri) @ 16:18

You could certainly make a case in favor of a linear approximation to the polynomial with the domain in question. You could also calculate the approximation error. Obviously, depending on how you calculate the linear approximation, your error may be skewed one way or another.

How did you develop this 4th order polynomial?


#10          (see all posts) 2011/11/04 (Fri) @ 16:24

I take back what I said.  When I plotted the function between 0.45 and 0.65, it looks reasonably linear.  As Bo said, it you plot any polynomial over a small enough range, it will look linear (except near a local maximum or minimum).  The quantitative question is, how small is small enough.


#11    Tangotiger      (see all posts) 2011/11/04 (Fri) @ 16:30

Right, exactly.

If the difference between a linear equation and a polynomial equation has a difference of 1MM$ in error (over the domain we are interested in), then why the argument about WAR and salary not being exactly linear?

For all practical purposes, it is linear.


#12          (see all posts) 2011/11/04 (Fri) @ 16:35

Tom...I find that the difference between a linear fit and the polynomial over the domain 0.45-0.65 is in the range -0.88 to +0.88.  As long as that difference is not significant to you, then it’s ok to call it linear over that domain.  Or, more precisely, it is well approximated by linear.


#13    berselius      (see all posts) 2011/11/04 (Fri) @ 16:38

around 0.6 to 0.7 the second and third derivatives vanish, so it is fair to say that it’s close to linear in that range.


#14          (see all posts) 2011/11/04 (Fri) @ 16:40

You can linearly approximate just about any function, it’s just a matter of how much approximation error you’re comfortable with.

Imagine the line y=x as an approximation to the function y=sin(x) (this is, in fact, where the Taylor series begins for sin(x)). From x=-.75 to x=.75 it’s a pretty good approximation, but you wouldn’t want to extend it much beyond that.

On the same token, you wouldn’t say that y=sin(x) is linear from -.75 to .75, but rather you’d say a linear approximation would suffice.


#15    davemcgr      (see all posts) 2011/11/04 (Fri) @ 16:40

This reminds of using OLS regression (opposed to probit) when working with non-extreme proportions. So I’m going to have to agree, for practical purposes there doesn’t seem to be reason to treat it as something other than linear.

With the said, I find myself caring a lot more about how much a player should be paid rather than how much the market dictates. To me the question is really should salary have a linear relationship with WAR under a restricted range? And my gut says no.


#16          (see all posts) 2011/11/04 (Fri) @ 16:55

One also has to be careful with language and how someone is likely to interpret “linear.” Often, one interprets it as meaning, “if we double x, we will double y”.  That would be incorrect in the present context, since there is an offset (i.e., x=0 does not mean y=0).  Still, some people might misinterpret your meaning if you claim a linear relationship.


#17    Tangotiger      (see all posts) 2011/11/04 (Fri) @ 17:20

Right, exactly.  I think people don’t appreciate the “offset” and what it does to the relationship.

You can pretty much always create an equation by applying the offset, such that “double x” = “double y”.


#18          (see all posts) 2011/11/05 (Sat) @ 16:59

Tom/17:  To be more explicit: 
If y=ax+b, then (y-b)=ax.  So, doubling x will double y-b (b is the “offset").


#19    pierre      (see all posts) 2011/11/05 (Sat) @ 18:16

tango/6- except WAR=0 at about, what, a winning % of .3?  So, over the domain WAR>0, the relationship between WAR and $ is not linear.  If I’m following.


#20    bruce      (see all posts) 2011/11/06 (Sun) @ 17:50

#19 - The main point is not the absolute shape of the curve. It is whether a linear approximation is good enough. The curve is linear except as it gets close to replacement level (winning = 30%). Replacement level has a fixed salary = MLB minimum, which is probably what forces the curve to curve.

If you’re worried about how Pittsburgh and Houston should be looking at free agents with strictly limited budgets then you could worry about the shape of the curve below 45%. Most people are not (I say this as a long-time Astros fan). If you’re thinking about teams who might have a chance at making the playoffs then you’re above 45% and linear is good enough.


#21    Colin Wyers      (see all posts) 2011/11/07 (Mon) @ 00:45

So, what are x and y?


#22    Arne      (see all posts) 2011/11/07 (Mon) @ 17:57

#21: “So, what are x and y?” I gather from #4 that x is win% and from #6 that y is salary, although it has not of course been explicitly defined in the original post. #6: “So, can’t we just say that there’s a polynomial relationship between wins and salary, but at the points that we are interested in (win% .450 to .650), that it is about as close to a straight line[?]” No.


#23    pierre      (see all posts) 2011/11/07 (Mon) @ 20:02

I don’t think we’re just interested in the range from .450 to .650. If I remember, WAR=0 is somewhere between .300 and .350.  So, the issue is the $/win relationship between 0 and 1 to 1.5 WAR.


#24    Tangotiger      (see all posts) 2011/11/08 (Tue) @ 08:12

At the player level, replacement level is around .400 win%.

Think of what a pitcher’s win% is if he was surrounded by average hitters and average fielders.

So, that’s why most pitchers that we have any interest in is in the .450 to .650 range.

And Alan is right about the error range.

So, if you are ok with a 1MM$ error range (i.e., 0.25 wins), then forcing a straight line rather than polynomial is an acceptable trade.

Personally, 0.25 wins of error is extremely acceptable to me.


#25    Pierre      (see all posts) 2011/11/08 (Tue) @ 10:22

Tango, Fangraphs says that a team of replacement players would win 48 games.  How do I reconcile .400 and 48 wins?  Is one right and the other wrong, or are the two statements not contradictory for some reason I don’t understand?  Thanks!


#26    weskelton      (see all posts) 2011/11/08 (Tue) @ 17:00

Echoing Colin’s question…

So, what are x and y?


#27    Tangotiger      (see all posts) 2011/11/08 (Tue) @ 17:45

Pierre:

I said this:
“At the player level, replacement level is around .400 win%. Think of what a pitcher’s win% is if he was surrounded by average hitters and average fielders. “

You said this:
“a team of replacement players would win 48 games”

***

x is the win% of your pitcher, and y is how much he would get paid

It’s just a reasonable illustration.


#28    Tangotiger      (see all posts) 2011/11/08 (Tue) @ 17:58

Pierre: basically, it works like this.

If you have a team of average players, and one replacement-level player, and that player in a full-time position player, that team will win .485 games.  If you have two full-time position players that are replacement-level, that team will win .470 games.  You have eight full-time position players, and that team will win .380 games.

(It’s not that simple, but this illustration works.)

On the flip-side, start with a team of average position players, but a team of replacement-level pitchers.  That team will win .410 games.

Finally, put the two together, a team of replacement-level position players and a team of replacement-level pitchers, and you will win close to .300 games.

Got it?


#29    Colin Wyers      (see all posts) 2011/11/08 (Tue) @ 18:31

x is the win% of your pitcher, and y is how much he would get paid

It’s just a reasonable illustration.

Okay, so what you have is a graph that’s essentially linear for pitchers with win percentages between .700 and 1.000. Yay? That covers, what, two pitcher seasons out of the past decade, for full-time starters? Maybe you quibble and say it starts to straighten out around .600 - that’s still a really small portion of the total pool of pitchers.

If you look at the totality of the graph above .400, it’s mostly linear with a slight curve. If you weight by where most of the pitchers actually occur (well over half of all full-time starters are gonna fall between .450 and .550) then it’s pretty obviously curved.


#30    Colin Wyers      (see all posts) 2011/11/08 (Tue) @ 18:40

If we crop the graph to focus on just the area where there are actual pitchers:

http://www.flickr.com/photos/42654229@N00/6327314740/

The curve appears far more pronounced once you strike out the range of about .750 to 1.000 where you don’t see any starting pitchers (in terms of true talent, at least) and few to no relievers, either.


#31    mettle      (see all posts) 2011/11/08 (Tue) @ 18:40

Does this mean I can be paid $100k to take a .250 pitcher?

***

Shrink your range enough and *Everything* is linear.


#32    Tangotiger      (see all posts) 2011/11/08 (Tue) @ 19:04

I said to focus on .450 to .650, and Alan showed that the error range from a linear line is under 1MM$.  That’s 0.25 wins.

That’s why I say even if it is curved (which it HAS TO BE because the polynomial line shows it to be), it’s ESSENTIALLY linear, which is my point.


#33    Colin Wyers      (see all posts) 2011/11/08 (Tue) @ 19:28

I don’t know what essentially means here. Is OPS essentially linear weights? Is Basic RC essentially the team run scoring process? Is (R+RBI-HR)/2 essentially batting runs?


#34    pierre      (see all posts) 2011/11/09 (Wed) @ 07:01

tango/27&28- yes, I understand.  So, is the part of the curve around .450 an accurate representation of the cost of replacing Kevin Kouzamnoff with Scott Sizemore (or some other example of going from 0 to 1-1.5 WAR)?  That is, why is .450 to .650 the range of interest?  I am wanting to think about it at the player level, as I suspect that the relationship is more or less linear above 1-1.5 WAR, but not below.


#35    Tangotiger      (see all posts) 2011/11/09 (Wed) @ 09:09

I already described what essentially means: an error of 0.25 wins between the true line and the linear approximation line.

weighted OPS would be essentially Linear Weights (probably).

Since virtually all MLB players are in the .400 to .700 level, that’s our universe.  I don’t particularly care about those guys on the bubble, those under .450.


#36          (see all posts) 2011/11/09 (Wed) @ 10:25

Where did the line come from?


#37    Tangotiger      (see all posts) 2011/11/09 (Wed) @ 11:08

Mike: it’s just a reasonable illustration, which is why I didn’t want to talk about what x and y really meant.

***

The point is to show that you can actually have a curved line, but at the point where we are interested in, a linear approximation is “close enough”, if you define “close enough” as an error of 0.25 wins.

This thread can be used in tandem with my other thread, where I ask what does x equal in this equation:
x + 0.7 = 3 + 3

People talk about premiums for the single superstar, but when it comes down to it, the wiggle room will basically be 0.25 wins.

So, you hear complaints that you can’t just add two players’ 3 + 3 WAR to match to two other guys at 5.3 + 0.7, but when you ask for a specific equivalent, you basically end up with a deviation of at most 0.25 wins.

Therefore, we can all acknowledge that it’s not a perfectly straight linear relationship, but if you can accept a 0.25 win deviation, then a straight line does the job.

But, I understand some people need to get the perfect polynomial line, and to them I say: good luck!

As I get older, I find that accepting a 0.25 win deviation frees me from constraints, and let’s me move on to other stuff.


#38    Karl Frederick Gas      (see all posts) 2011/11/09 (Wed) @ 14:59

"The point is to show that you can actually have a curved line, but at the point where we are interested in, a linear approximation is “close enough”, if you define “close enough” as an error of 0.25 wins.” If this statement is true, then you can concoct a linear approximation for darned near anything. And, to go back to the use of “polynomial relationship” in #6 (and its implied existence elsewhere), you can also fit a polynomial rather well to any data set, so long as you’re using as many or more terms in your polynomial as you have points in your data set. This thread reminds me of the posting in which MGL asserted that you can increase the accuracy of your data reading by keeping all the digits from a decimal readout, insignificance of the figures notwithstanding. Another example, perhaps, of why true scientists and statisticians don’t take many practitioners of “sabermetrics” seriously (especially those who would reinvent the established fields to suit their own narrow ends).


#39          (see all posts) 2011/11/09 (Wed) @ 15:32

Another example, perhaps, of why true scientists and statisticians don’t take many practitioners of “sabermetrics” seriously (especially those who would reinvent the established fields to suit their own narrow ends).

There are “true scientists and statisticians” that are among the frequent contributors to this blog (I know there’s at least one Ph.D. in this very thread!)...I’m pretty sure they take sabermetrics (and practitioners thereof) very seriously.  This might be true of *some* “true scientists and statisticians,” but then it becomes a rather pointless statement.

Now let’s get back to focusing on the issue at hand…


#40    Gas      (see all posts) 2011/11/09 (Wed) @ 16:54

"There are “true scientists and statisticians” that are among the frequent contributors to this blog (I know there’s at least one Ph.D. in this very thread!)” Having a Ph.D. hardly qualifies one as a true scientist/statistician by itself. As far as focusing on the issue at hand, it is interesting that this thread commenced with a canned fit of some unspecified data to a 4th-order polynomial with an associated R^2. Wouldn’t a spline have done? Wouldn’t it also be maximally useful in this situation to pinpoint the domain of maximum nonlinearity? That would seem to be a more useful exercise in this context than the speculation on approximation sufficiency.


#41    Tangotiger      (see all posts) 2011/11/09 (Wed) @ 17:10

Gas: the usefulness of this exercise is exactly as I presented.  Feel free to create your own exercise on your own blog, and I’ll link to it.


#42    WanderingWinder      (see all posts) 2011/11/09 (Wed) @ 17:42

It all depends on what it is you’re modelling. For any finite set of data, there are, of course, an infinite number of equations to perfectly model that data. The first major issue is that some of those just don’t seem to intuitively make any sense. To take a physics example, instead of having energy go as velocity squared, you could, based on your 10,000 measurements, go with some fit to a 9999th order polynomial, or an exponential added to a polynomial, or a higher order polynomial, or… what have you. And there’s a good chance your 10001st data point will be way off, especially if it’s not close to the first 10000. So we use a straight second power because it’s simple (and really really close if not perfectly right), and we just somehow believe that things are simple (no real evidence that it’s this rather than complex, but I think it’s good intuition).

The other issue is what are you using it for. Because man, it’s going to take me a long time to calculate those 10,000 terms in my polynomial (well, I’ll use a computer, but then I have to program it, and I need it to do my computation). Linear I can pretty much do in my head. So another reason to go linear, based on the application.

So in this case, for quick, general considerations, and I know I’m in the range where the approximation is good, I’ll use linear. If I happen to be near a computer, or if I need extreme accuracy, or if I’m outside the valid range of my approximation, I’ll use something else. But this is also assuming I don’t have some outside reason to think it’s exponential or polynomial or linear or any particular kind of model, which, without knowing what it IS, I don’t.


#43          (see all posts) 2011/11/09 (Wed) @ 19:33

There’s a whole bunch of theory on what’s meant by “good” in an approximation.  Tango is satisfied with maximum absolute residual, but it isn’t clear why that’s the right measure of the acceptability of his fit.  You need to carefully account for how the residual to your fit affects the results of whatever you want to do with your approximation.  In this case the residual is a 4th order polynomial and you want to be sure that whatever you do with your linear approximation, the 4th order residual doesn’t change the interpretation of that result.  Just measuring the absolute error isn’t mathematically sufficient to make that case.


#44    pierre      (see all posts) 2011/11/09 (Wed) @ 22:15

tango/37- my point, which I think I failed to make, is the following:  A 0 WAR player is at about .300, a 1 WAR player is at about .400.  The cost associated with moving from 0 WAR to 1 WAR is very low (imo).  Assuming the $/win curve is linear above 1 WAR, then you would not trade a 6 WAR player for 2 3 WAR players.  The math would be 3-1 + 3-1 = x-1; x=5.  So, 2 3 WAR players should buy you a 5 WAR player.  2 2 WAR players would equal a 3 WAR player, 2 4 WAR players a 7 WAR player, etc.


#45    Tangotiger      (see all posts) 2011/11/10 (Thu) @ 00:13

First of all, I alreayd said the baseline is close to .400, not .300.

In any case, if you believe what you are saying, then create a WAR such that the zero baseline actually represents the zero baseline (i.e., raise your replacement level higher).


#46    pierre      (see all posts) 2011/11/10 (Thu) @ 08:18

I meant no offense.  Just that I think the discussion about “scarcity” is really a discussion about the replacement level. 

And there’s probably no right answer to the question of where to set the replacement level.  A 1 WAR true talent player, say Chris Getz, may have very little trade value, and his team may be dying to replace him, but he’s not free, and finding someone better may not be that easy.


#47    Tangotiger      (see all posts) 2011/11/10 (Thu) @ 10:33

I didn’t take any offense, and I hope I didn’t convey that I took offense.

In any case, I don’t think people appreciate what 1 “WAR” means.  The WAR has ALREADY been baselined.  But, then people want to further baseline it (as you were trying to do).

Alternatively, people are treating one baselined version of WAR as useful for salary, and another baselined version of WAR as useful for trade value.  And I’m trying to point out it should be the identical baseline.

Once you decide on whatever baseline you want, then that baseline is what it is.  It’s used for salary and for trading.


#48    pierre      (see all posts) 2011/11/10 (Thu) @ 11:18

I think the exact same dynamic exists for both trades and salary.  $5mm/win may be a very good rule of thumb when you’re looking at Prince Fielder, but if your GM is paying $5mm for the Chris Getzes of the world, he’s not doing his job.

Nor am I saying the replacement level should be changed.  I’m eyeball-engineering 1 WAR as the level above which the relationships are additive/linear, but setting the baseline at 1 WAR would be totally arbitrary, and setting it the level where players are completely fungible is not.


#49          (see all posts) 2011/11/10 (Thu) @ 13:03

Just to chime in here on the “true scientists and statisticians” bit:  Doing physics is not the same as doing math.  Math is precise; physics (contrary to popular belief) deals with approximations.  In fact, much of the art of doing good physics is really the art of knowing how to make good estimates and approximations.  I can speak with lots of experience that baseball physics is not the precise kind of physics that is taught in the textbooks.  It involves model-building and simplification.  For example, if one were to insist on understanding the flight of a baseball from first principles, one would not be able to do even the simplest of trajectory calculations.  So, we simplfy by parametrizing our ignorance in a drag coefficient, which is really nothing more than a “fudge factor”.  But it provides a useful description of the aerodynamic effects, at some level of approximation.  As long as that level is acceptable for one’s purposes, then one has a useful tool for analyzing trajectories of baseballs. 

In the present context, the basic premise of Tango is that, over some limited range of “x values”, the corresponding “y values” are approximately linear, to an accuracy that is acceptable for his purposes (whatever those purposes may be).  Perhaps that accuracy will not be acceptable for the purposes of other people. 
Of course, with any approximation one must be careful not to overinterpret the results.  Or said more positively, to be sure the interpretation is made withing the context of the accuracy of the approximation.  Also, one must be careful not to apply the approximation outside its range of validity (in Tango’s case, below x=0.45 or thereabouts.).

End of rant!


#50    Tangotiger      (see all posts) 2011/11/10 (Thu) @ 15:11

Alan: excellent non-rant rant!


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 09:31
Do pitcher’s reach back for velocity when needed?

May 25 08:11
What sabermetrics is NOT

May 25 06:43
Largest demonstration in Canadian history?

May 25 06:39
Lack of hustle during a game

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 24 23:50
Rooting for laundry

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story