THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Friday, May 09, 2008

Amount of regression: Here is an interesting question having to do with one of the other threads…

By , 04:27 AM

Maybe Tango and some of the other stat guys can take a stab at this.


In reference to the work I just did on hot and cold starts for pitcher (K-BB)/PA, I was thinking this:

Let’s say that we decide or determine that a certain performance or stat should be aggressively weighted by recency, for whatever reason.  Presumably, it is because players, with regard to that stat, tend to change their true talent level either a lot or by a significant amount, or both.  But it doesn’t matter why, let’s just assume that that is the case.

I mentioned in that thread that we have to use an “effective” sample size for regressive purposes when we weight a series of stats.  For example, if we have 100 PA in year X and 100 PA in year X+1, and we weight year X+1 100 times more than year X (essentially ignoring year X), we can’t include the 100 PA in our sample for regression purposes, can we?  No, of course not.  But this creates a conflict.  If we now have a much smaller sample to regress, due to our aggressive weighting, we effectively nullify the effect of that aggressive weighting! 

Let’s say that in 1996-2006, our player has a rate of 10 (for anything) in 5000 PA and in 2007, he has a rate of 20, in 100 PA.  And let’s also assume that the league average of this guy’s population is 5.  We are going to weight 2007 100 times more than 2006, so we are essentially ignoring all of the prior years.  So we now have an effective sample size of only 100, and we regress toward a league average of 5 and we come up with something LESS (depending on the regression equation of course) than this guy’s career average!  That can’t be right.

Even though you are weighting 2007 very heavily, you would have to regress this guy’s 2007 stats (the 20) toward something other than the league average of 5.  You have to somehow take into consideration the fact that the guy has had 10 prior years with a rate of 10, even though you are weighting the 100 PA in 2007 so heavily.

So the question is, if you have a guy with a lot of history, and you are aggressively weighting recent performance, what do you use for your sample size to do the regression, and what do you use for your mean to regress toward?

Let’s use a more practical example.  We are using a weighting system that weights each year 3 times the previous year.  A player has 500 PA in each of 3 straight years, and then 300 in year 4 and 300 in year 5.  For this particular stat, the regression equation we are going to use is 600/(PA+600).  The population mean is 5.  Our player has averaged 20 in the first 3 years (for 1500 PA total ), and 30 in year 4 and 5 (for 600 total PA).

How would we do a basic Marcel using weighting and regression?  I don’t see using mostly year 4 and 5, which we will end up doing because of the aggressive weighting, and then regressing only 600 or so PA toward 5 (the population mean).  That does not make sense as this guy has another 1500 PA at a rate of 20 that we are essentially ignoring.  Surely their must be another way.  We are essentially saying, “We are pretty sure that this guy is an above average player (with regard to this stat) since he is way above the population mean in 2100 PA,” but we also suspect that his true talent has changed after 2006 and 2007.  How do we handle both of these things.  If we just do a weighted average of all 5 years, and then regress toward the population mean using our “effective number of PA” (which will be close to only 600, the last 2 years of PA), we will come up with a bad answer because we will be regressing too much.  I have no problem with the weighted average, but what do we use for the number of PA in the regression equation?

#1    Tangotiger      (see all posts) 2008/05/09 (Fri) @ 10:56

If I may summarize for the benefit of those who may be missing it.

We are always giving a “weight” of 1 to the sample that we trust the most.  The other samples that we trust less gets a percentage of that weight.  So, if you, for some reason, decide to trust 156 PA of someone who has a 39/2 K/BB ratio far more than his career, then you are weighting his 156 PA as “1” (total of 156) and his other 3000 (or whatever) career PA at “0.1” each, for a total of 456 PA.

However, someone with a more reasonable K/BB ratio would get a weight of “1” for his most recent performance, and perhaps “0.5” for the rest of his career, thereby giving him say 1656 PA. 

So, even though he’s got the exact same number of PA as Cliff Lee, he’s got more weight.

Normally, when we regress, we basically give each pitcher the same, say “league average of 250 PA”.  But, as we can see here, that doesn’t make any sense.

***

That’s the summary. 

I don’t know the answer.


#2    MGL      (see all posts) 2008/05/09 (Fri) @ 17:07

I thought about this a lot last night. I know there is a mathematical answer, as all of these things are just a basic Bayesian problem (what is the probability he is a true X, given what he has produced in Y number of trials, given the chance that his true talent has changed from A to B, and given the distribution of true talent in the population we thinks he comes from.

But I still cannot come up with a reasonable way to do a Marcel, the way we usually do it, given a heavy weighting for recent performance, and the example I give.

Tango, can you try and come up with an answer to the example I gave?  Come on, you are good at this stuff!

I’ve never seen you turn down a challenge like that!


#3    tangotiger      (see all posts) 2008/05/09 (Fri) @ 18:26

This is something Andy can do in his sleep I think.  But, you are right, it is somewhat challenging to think about, and try to make sense of.  Let me sleep on it…


#4    william      (see all posts) 2008/05/09 (Fri) @ 18:48

immediate guess as a method which might give a reasonable ballpark figure:

Use the amount of performance we are giving the heavy weighting to, regressed towards are previous estimate of his talent (i.e. all his previous performance regressed towards the league mean):

For the basic example, the 20 would be regressed an appropriate amount towards 9.9 (or whatever is appropriate).

Thinking some more I’m fairly convinced this is not right(see below).
Sanity test: Suppose our regression equation is 100/100+PA, league mean is 5 and we have a 200PA sample at a rate of 10. Our standard result would be 8.333

divide the sample into 2 100PA samples and apply the suggestion I made above.

We regress to an estimate of 7.5, then regress 10 halfways towards 7.5 to get 8.75.

Late Night statistics results in failure


#5    william      (see all posts) 2008/05/09 (Fri) @ 18:57

ah - the 2nd regression eqution should be altered to 100/200 +PA, where the 200 is the 100 from the initial weighting (the population mean) plus the 100 from the first sample.


#6    MGL      (see all posts) 2008/05/09 (Fri) @ 20:45

I was going to say that I though your initial idea was good. Then I saw how your example did not work.  I don’t get your post #5.  The first regression puts us at 7.5, as you say.  Now we have another 100 PA at 10.  We want to regress the 10 towards the 7.5, in order to hopefully get 8.333.  That requires a regression of 67% of course, which implies a regression equation of 200/PA+200 (I think that is what you meant).  But I am not sure where that regression equation comes from, when our original one was 100/100+PA.  IOW, I am not sure how to derive that equation given different parameters.


#7    will      (see all posts) 2008/05/10 (Sat) @ 05:40

sorry - you are correct that the equation in #5 should be 200/200 + PA. The idea for altering the regression is that we are more confident in our estimate of his ability (of 7.5 in the example) after 100PA than we were in out initial estimate of league average (5). Given that the initial regression equation is 100/100+PA we are implicitly giving our knowledge of league average a weight of 100PA. The mean to be regresseed to outside of the sample data is based on this amount.

Whereas for the second regression, the mean we are regressing to is based on our knowledge of the league mean (given a weight of 100PA) AND the first sample of 100PA. The regression equation then alters to be (100+100)/(100+100+PA).

More generally I would guess:

(A+X*y)/(A+X*y+PA)

where A is the number from our inital equation, y is our number of PA in the previous sample and x is the weighting we are giving to them.

N.B. Take with healthy pinch of salt, this is pretty much guesswork


#8    MGL      (see all posts) 2008/05/10 (Sat) @ 05:46

I don’t think you can alter the regression equation - only the variables within it.

The regression equation is based simply on the spread of talent in the population. That never changes.

Take that, as well, with a hefty pinch of salt!


#9    Peter Jensen      (see all posts) 2008/08/04 (Mon) @ 21:13

MGL - Using the regression model that I propose where the population regressed toward is the player’s previous projection and you ignore entirely the league population mean completely solves the problem you present here.


#10    MGL      (see all posts) 2008/08/04 (Mon) @ 23:55

Again, I’ll reserve comment or judgment until I see what you are talking about by giving me an example using the data I supplied in the other thread.


#11          (see all posts) 2008/08/06 (Wed) @ 08:05

I have been carefully conidering comments on the other thread where Peter and MGL are debating this, and now I see that Peter has bumped up this older thread. I am currently working on a MLE and projection model, so I am paying close attention to the conversation.

Similar to the Marcel, I am weigthing the current season’s counting stats at 1.00, and each previous season by a factor of 0.8. If a player has more than a handful of pro seasons, there is sufficient data that the effects of a regression are negligible. Vets who play every day (Rollins, Ichiro, Pierre) have their projections based on up to 3200 PA. Anecdotally browing the data, around 800 PA appears sufficient for a projection which is closely consistent with later projections with a larger sample size (I will run tests to find the number of PAs where the error of obs-exp is less than a given tolerance).

However, using a factor of 0.8 means that 80% of the projection is based on previous years (assuming many previous years) and only 20% on the current year (that’s with a full current year). Projections change slowly from one year to the next. This may be a good thing. Using a factor of 0.50 will converge to 50% from current, 50% from prior.

Tango, do you have a link to any old threads which explains the 0.8? I know it’s the daily value expressed over 365 days, but I want to know the math which says it’s right. Or should we use intuition? The 0.8 is in a single line of sql code, so it’s no problem to modify and rerun.


#12    Tangotiger      (see all posts) 2008/08/06 (Wed) @ 09:25

They 0.80 was derived empirically, without much effort.

It really would be a very simple thing for someone to try to trial and error it himself.  Use this equation:
weight(year T): x^0
weight(year T-1): x^1
weight(year T-2): x^2
weight(year T-3): x^3
weight(year T-4): x^4
and so on

Add in a league average performance line for 240 PA.  (I know I’ve said 200 in the past, but I think I use 240.)

Then simply test the best forecast.

So, if we are Apr 1, 2007, take all of Pujols stats in 2006 and weight them at 100%, all his 2005 stats and weight them at x=80%, his 2004 stats at 64% and so on.  Do that for all players.  Then correlate those stats to their 2007 performance lines.

Then, try x=75%, x=70%, x=85%, try with the league average PA of 200, 150, 300.

Try a whole bunch of things, and then see what comes out the closest.  You’ll probably find that 80%/240 works out the best.

For pitchers, I use 70%, and I forget how many IP I put in for the regression component.


#13    Colin Wyers      (see all posts) 2008/08/06 (Wed) @ 13:38

You, too, Brian? I suspect you’ll do better than me.

On that note, if a bit off topic: do any of you guys have an idea of how to handle the platoon split? For instance, let’s take Mike Fontenot, unassuming Cubs backup second baseman.

My Marcel forecast for Fontenot is .270/.347/.432 currently, .343 wOBA. The problem is that Fontenot is left handed, and has 389 PAs againt RHP, versus 78 PAs against LHP. As a full-time player, Fontenot would presumably face more LHP, but I don’t know how to model that reality. Any ideas?


#14          (see all posts) 2008/08/06 (Wed) @ 15:51

This paper suggests HR totals are not normally distributed:

http://www.arthurdevany.com/webstuff/images/HomeRunHitting.pdf

Wouldn’t this suggest we could have more accurate HR projections if we used a bayesian approach instead of regressing to the mean for HRs?


#15    Tangotiger      (see all posts) 2008/08/06 (Wed) @ 16:17

Unless p=.500, I don’t see how anything can ever be normally distributed in its strictest sense.  That technicality aside, I think there should be very little skewness in any of the component rate stats.  For low rate stats like triples and HR, I even think they’d closely follow a Poisson distribution.

The Bayesian approach is always better, but I don’t see how if Teixeira is expected to hit 30.2 HR using RTTM, that the Bayes approach will give you anything other than something like 29.8 HR.

Marcel is intentionally kept simple on the hopes that one of you more resourceful guys can trounce him.

I’ll also point to this “article” as an inspiration to anyone who wants to do the same for the component stats:
http://tangotiger.net/talent.html


#16    dq      (see all posts) 2008/08/06 (Wed) @ 18:39

Virtually all baseball stats are not equally distributed, and are right skewed, as the talent shows. If you take the ba for most any season, there are many more players who are +1 SD below the mean rather than -1 SD. For example, for players with 200+ab in 2007, their total average is .277, SD +.031 - 40 players were over .308, 63 players under .246. I tried it with each league since 2000, and it was skewed every year. Drop it to 100 abs, and the number becomes worse.


#17    dq      (see all posts) 2008/08/06 (Wed) @ 18:42

Tango,

In the regression you mention above (#12), should I take the age adjustment into account? If so, does it get changed because we are going so many years back?

Thanks


#18          (see all posts) 2008/08/06 (Wed) @ 21:38

OK, I wrote a query to test the variance at different levels of weighting, with and without regression.

A weight of 1.00 means usign career totals to date, no discounting of past seasons. 0.90 Means 90% of projection from past seasons, more recent seasons weighted more heavily, down to 0.00 which means use last season only, a straight last year to this year comparison.

http://spreadsheets.google.com/ccc?key=pLg_vfW0QCD9LVK3sm2tVLg&hl=en

Best results in a column are marked in yellow. Anywhere from 0.6 to 0.8 has very little difference, with regression, the majority minimized at 0.7


#19    tangotiger      (see all posts) 2008/08/06 (Wed) @ 23:05

Brian, we have to make a request to see your document.  Expect to see some emails…


#20    dq      (see all posts) 2008/08/06 (Wed) @ 23:07

I just sent a request to see it


#21    MGL      (see all posts) 2008/08/07 (Thu) @ 02:51

Colin, #13, there are many ways to do it, but you simply take his overall projection based on all of his historical stats, making sure that the mean you use to regress everything do is adjusted for his disrtibution of pitcher handedness faced.  IOW, let’s say that I have a lefty hitter who has only faced righty pitchers.  His weighed historical stats are .900 OPS. What do I regress that towards to establish a projection? Obviously all LHB facing only RHP.  Not all LHB in baseball.

So you come up with an overall projection for him, using his historical stats and the appropriate regression point. Now you take his projections and split it into “versus RHP and versus LHP”.  TO do that, you need to do a projection for his platoon ratio or differential (or you can just assume that he has the same as the average LHB).  To do that, you use his sample (historical) platoon ratio and differential and then regress that toward the league average for LHB using his 78 PA versus LHP to determine how much to regress.

Since 78 is so few PA, you are going to regress his own platoon ratio almost all the way to the league mean, so you might as well just use the league mean.

So now you have his projection versus LHP and RHP.  Now you just weight those using whatever ratio of LHP/RHP he is going to face as a full-time player and voila!

As I said, there are other ways of doing it, which will get you to the same place.  That is just one of them, and the way that I do it.

BTW, I have always said that because the distribution of talent (whatever talent you are projecting) is not nearly normal, using the full Bayesian method is going to be different and better than using a “x/(x+opp)” regression amount equation, since that equation assumes a normal distribution.

The full Bayesian is of course, “What percentage of all players have true HR rates of .001 per PA, .002, .003.....098, .099, etc.” Then to figure out what the chances are of each of those players doing whatever your player has done and then to take the weighted average of those numbers.

For example, let’s say that there were only 3 talent levels of BA among major leaguers.  You are either a .200, a .250, or a .300 hitter.  50% of all players are .200 hitters, 30% are .250, and 20% are .300.

Let’s say that you have a player who hit .310 in 100 AB.

You use the binomial formula to figure out the chances of a true .200 (p = .2) hitter hitting exactly .310 in 100 AB, a true .250 hitter hitting exactly .310 in 100 AB, and the same for a true .300 hitter.

Let’s say that those numbers are .0001, .0002, and .0003.

Now the ratios of the probabilities of your .310 player being a true .200, .250, or .300 hitter is (.0001 * .5) / (.0002 * .3) / (.0003 * .2), or

.00005/.00006/.00006 or 5/6/6.  So there is a 29.4% chance he is a true .200 hitter, and a 35.3% chance that he is either a .250 or a .300 hitter, which means you have to call him a true .253 hitter, even though there is no such thing in our hypothetical.

The nice thing about using this true Bayesian method is that you can not only give the mean estimate but you can easily give other estimates like confidence intervals, the chance that he is a .250 hitter or a .200 hitter (or any other number if your distribution of true talent in the population is fairly smooth and not just consisting of 3 values as in my simple example).

Now if your distribution of true talent is smooth (you will have to assume it is smooth, since all you will have is some shape with a mean, median, and SD), in order to do the full Bayesian approach, like I did above when there were only 2 talent levels in the population, you will need a computer to pick out a few hundred (or thousands, or whatever you want - the more the merrier, especially if the distribution is weird-shaped) intervals of possible true talents, like .150, .160, .170, all the way up to .380 or so.  I don’t know how important it is to include true talent levels like .150 or .400, since we don’t really know whether they exist in the population or not.  It probably does not matter that much since the percentage of players in the population at those levels will be so small that the terms that include them will be almost zero.  For example, if a player hits .400 in 300 AB (like a Chipper) even though the “chances of a true .400 batter hitting .400” will be the highest number in your sequence, when you multiply that number by the percent of true .400 hitters in the population (.0001 maybe), that terms will be negligible next to the other terms.

The tricky part about doing the full Bayesian method (of estimating true talent from sample performance) is getting the right distribution of true talent.  The rest is easy.  You are merely using the “chances of getting X successes in Y trials given a certain true p” as one of the multipliers in each of your terms, the other multiplier being the percent of that true p in your population as in the example above.

I am not sure how to derive that distribution of true talent.  Up until now, we have always assumed it was a normal curve with an easily computable mean and a SD which is fairly easy to computer (using the formula “the observed variance is the variance of true talent plus the random variance").  But if it is not nearly normal, how do we come up with the exact distribution of talent?  I’ve never been able to solve that one.  If our estimate of the exact distribution of true talent is anything close to normal, then we defeat the purpose of doing a full Bayesian and we might as well use the simplified regression equation x/(x+opp), where opp is the number of opportunities, like the 100 AB above, and x is a constant gotten from analyzing the observed data, like 600 for BA or 200 for OPS, or whatever it is.


#22          (see all posts) 2008/08/07 (Thu) @ 03:23

I didn’t see the “Publish” button - it shuld be OK now. I’ll check my inbox to approve any existing requests.


#23          (see all posts) 2008/08/07 (Thu) @ 05:44

Colin/#13 - I ran into asimilar problem when dealing with park factors. I used Retro pbp, broken down by handedness of batter, among other things. So now I have factors for RHB, and factors for LHB. What do I do for switch hitters?

I agree with MGL, but this is how I would explain it - in either the park factor normalizations of the platoon splits, treat him as two different players, grouped by hand, when doing the calculations. As Fontenot only has 78 PA this year vs LHP, this is a good example for getting as much data as possible, including MLEs, to build the sample size. Regress, then combine the seperate lines knto one player, weighting by your best guess at the playing time vs each type of pitcher.

An example of why to regress to what you know about a player - to do the projection test, I added the regressin formula to the projections query, but used data for a major league avg batter. There are pitchers in my batting table, and now they project as .240 hitters. I know they are pitchers, so therefor I know that they should be expected to hit .150, until the record proves otherwise.


#24    MGL      (see all posts) 2008/08/07 (Thu) @ 07:00

Brian, your posts got split up into 2 threads, I assume by accident.  I see that you partially explain what .9 or .7 means in terms of weighting, but your explanation is not complete, at least to me.  Please give an example.

Colin you definitely do NOT want to treat a player’s splits as 2 different data sets and regress each one separately, if that is what Brian is suggesting.

The reason is that one informs the other which allows you to not regress each one so much.  For example, let’s say that there were 100 different platoon splits, rather than just 2.  And let’s say that a player had 1 PA per platoon split and a total of 100 PA.  If you did them separately, you would end up using league means for each one and the player would have a league mean projection overall.  Obviously that is not right. Since he has 100 PA, you don’t want to use a league mean although obviously you will regress the sample performance in 100 PA a lot towards a mean, depending on what you are measuring of course.

For Fontenot, since he only has 78 PA versus LHP, you would end up regressing his BA (or whatever) in those 78 PA almost 100% toward the mean of all LHB versus LHP.  However, let’s say that his 500 or so PA versus RHP were very poor, in the .200 BA range.  Well, we suspect that he is a poor batter, given his BA versus RHP.  So you would want to use that information (that he is a poor batter) in projecting him versus LHB also.  The way to do that it the way that I describe.

So you definitely want to do it the way I describe, or some other way which is the equivalent (that gives you the exact same result).


#25          (see all posts) 2008/08/07 (Thu) @ 07:13

MGL, I see what you are saying. I am aware of this effect of accumulation of regressions. I will think about this again.

My projections don’t use regression until the final step, after the data has been weighted, etc.

If you collect as much data on the player as possible, then you can minimize the effect regression has on the final answer. Find more than 78 PAs by going into prior seasons and into minor league data.


#26    dq      (see all posts) 2008/08/08 (Fri) @ 09:21

Tango

15/21 - Is the reason that the normal distribution works because the population is really the opportunities times talent (Chart 6 of Tango’s Talent http://tangotiger.net/talent.html)- and not the talent. (Charts 1-4).

Marcels (and most projections systems) is/are computing the mean by weighting it based on pa. (league ba is based on total ab) They are computing the variance by weighting the results by the number of pa or abs etc.

I “think” the reason Marcels works is because it really is testing Chart 6 - for which the rules/laws of normal distribution applies.

If you were to not weight the results by pa,and take a simple average, then I think you would run into whatever difficulties you get from a right-skewed population. Because of Chart 6, if you test based on talent times opportunity, you get a situation where you can use the normal distribution.


#27    Tangotiger      (see all posts) 2008/08/08 (Fri) @ 09:34

dq/26: agreed.

Lucky Marcel never bought into all that right-tail business, since he figured out that talent x playing time will gives you normal-ish results.


#28    dq      (see all posts) 2008/08/08 (Fri) @ 09:44

Does that mean Marcels is really a chimp, since he doesnt have a tail, and not a monkey?


#29    MGL      (see all posts) 2008/08/08 (Fri) @ 10:28

I did not know that chimps have no tails and monkeys do, but now that I think about it…

I don’t think that is right!  When you are doing the regressions, you can’t assume, for example, that since a player played 500 PA per year, that he comes from a population of 500 PA per year players!  The reason he played 500 PA (or ONE of the reasons at least) is because he played well.  If a player played only 100 PA, it is because, most likely he played poorly, or is perceived as (or is historically) a poor player.

In doing the regressions (in fact, that is the whole point) you must assume that you have no idea whether a player is a good player or a poor one (other than if that information comes from something other than his stats), so you cannot use his playing time to make an inferences about his skill level, assuming that his playing time is significantly related to his prior stats!

Now, if playing time were based on scouting or something other than past performance, then we might be able to use playing time as a proxy for which population a player comes from.  Especially for established players - but that would only be IF, for example, a player gets a “reputation” for being full or part-time based on his first few years’ (and his minor league) performance, and then that reputation stays intact, regardless of how he performs after that.  For example, let’s say that we had 2 players with a .700 OPS after 3 years of MLB service.  Now they are relegated to part time duty.  One player posts an .850 OPS in the next 2 years and the other another .700.  If both players still get the same playing time in the next year, then we are OK.  If the .850 player gets more time, then we have problems Houston…


#30    Tangotiger      (see all posts) 2008/08/08 (Fri) @ 11:05

I don’t see any problem with regressing player’s performances based on the number of PA.  Puan Jierre for example would regress better than Cndy Ehavez or Bichael Mourn, even if we “know” all three are really the same.  That’s simply one mistake we accept in return for the better results we get in the other cases.

You could do the same for positions, like catcher or SS, so that if you have two identical hitters at SS and RF, the RF will regress better, even if we “know” that Rlex Aodriguez is not the prototypical SS, and in fact has the body type and mannerisms that would suit him to RF.

Now, if we had the Fans Scouting Report, and we have the height and weights and handedness, then we really wouldn’t need the player’s fielding position, since it’s crazy to regress ARod differently based on whether he is playing SS or 3B!

It’s crazy to regress Pierre or Chavez differently based on whether a manager is to stupid to tell the non-difference, and making Pierre a regular and Chavez a 4th OF.

Overall, however, you will almost certainly get a better result if you include past PA in the regression.


#31    dq      (see all posts) 2008/08/08 (Fri) @ 11:12

Chimps are apes, which have no tails. As a kid, I looked at volume A of the encyclopedia more than the other letters. (Im old, we used to have the bound set of encyclopedias) - I guess selective sampling makes me more knowledgable about apes than zebras.

Im not saying the player came from a population of 500 pa players; Im saying the population you are testing is of plate appearances, not players.

We are taking the mean by summing all the pa * results / pa. So it is a mean of the plate appearances.

We are calculating variances and stdev by weighting by pa- which is the same as summing pa. So, I think it is the variance/stdev of plate appearances.

Im thinking we are using the TangoTalent graph 6 as our poulation.

If we were using just talent, then I dont think we should weight the population by pas - By doing that you are giving 3 times more weight to a 600 pa player than a 200 pa player - Im sure there is a statistical reason to weight the 600 pa player more due to confidence level, but it shouldnt be 3 times more.

I am not addressing playing time versus talent at the moment, I am only trying to figure out why Marcels works, and if you can improve the monkey/chimp.


#32          (see all posts) 2008/08/08 (Fri) @ 14:34

MGL/21- can you use Bayes’ theorem for something like WAR.  Ex- say a player performs at 2 WAR over 500 PA and you know the distribution of WAR among players.  Could you use Bayes’ theorem in that example to find his true WAR?


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 12 05:18
Reader Mail of the Day: Why do we need X years of fielding data?  And what about outliers?

Feb 12 04:55
Who is Jeremy Lin?

Feb 12 03:15
New PECOTA

Feb 12 02:42
Whitney Houston

Feb 12 02:23
Psst… wanna intern in Canada?

Feb 12 00:40
Clutch analogy

Feb 11 20:11
Fighting leads to goals?

Feb 11 19:55
Why do players get crappy caps?

Feb 11 19:12
Hero of the month: Brittney Baxter

Feb 11 17:59
MGL: Today on Clubhouse Confidential