14 March 2017

Step 1: Find a Box. Step 2: Is Chris Davis in that Box?

Over the past few weeks, Patrick Dougherty and I have been throwing lineup optimization regression models at you.  I introduced the decade old lineup model identifying run value on a positional basis and introduced a new model that identified runs batted in on a co-dependent positional basis.  Patrick then did a thoroughly best fitting of the new model and found that Chris Davis leading off was the best iteration of the starting nine we evaluated.  However, it is easy to note that Chris Davis is an atypical leadoff hitter, so how atypical is he?

A best fit line on a scatter plot is an easy visual to understand.  You can observe the range of data on the x-axis and on the y-axis.  You have a decent handle on whether a new data point is found within that range of data points or if it is an exceptional outlier.  Intuitively, the degree to which a data point is an outlier, the more and more your concern rises about whether this model can realistically handle your new data point.

David Freedman, an economist, is known is some circles for his cheeky Conservation of Rabbits Principle.  He states that in order "to pull a rabbit from a hat, a rabbit must first be placed into the hat."  In other words, a model outcome that is different from the model input should be highly questioned, so let us explore Chris Davis as a leadoff hitter.

We shall ignore the seven, eight, and nine hitters.  A quick glance over them shows us that they are reasonable bottom third lineup hitters.  Chris Davis at the top of the order feels a bit more peculiar.  The model considers walk rate, strikeout rate, doubles rate, and home run rate.  Those metrics were the most relevant based on significance testing.

Chris Davis is projected by ZIPS to walk 11.6% of the time he is up at the plate.  Of the 300 data points over the past ten years for a team's leadoff hitter, 15 are within 10% of Davis' projection.  In total, that rate would be the 17th best and on par with excellent walk rates put forth in  2007 and 2008 by the Orioles' own Brian Roberts.  Anyway, a top ten percent walk rate certainly stretches the model, but stays within the boundaries set by the data.  With doubles, Davis is well within the model variables with his 3.7% projected rate.  That, however, is certainly not very impressive among the data points in the data set.  He would be 248th out of 300 positions.

Davis also has a projected 33.6% strikeout rate.  That is off the model radar.  As noted, the model has 300 team entries and the highest rate is the 2016 Brewers with 26.5%.  Davis' rate would be a 30% increase over that.  Davis is also projected to have a very impressive 6.6% home run rate.  That is also about a 30% increase of the next closest number, which is the 2016 Twins.  With respect to these metrics, we are in an area that the model is not well supplied to use that information. 

What about the aforementioned Mark Trumbo?  For home runs, he is 10% over the extent of the data in the model.  His doubles are right smack dab in the middle.  His strikeout rate would be third worst in this dataset.  His walk rate would also be in the middle.  As a whole, we should feel more comfortable with Trumbo's projection as a leadoff man than Chris Davis', but both are so unconventional that a regression model like this might be extrapolating effects beyond where we should feel comfortable.

The lesson here really should extend beyond the exercise Patrick and I have been performing.  It is important to understand causality and the limitations placed upon us to be able to determine what exactly causes anything else.  Certainly, I would think that we all agree that induction is useful to determine a better grasp on causation, but that we must be quite transparent and acknowledge the uncertainty involved in our methods of induction.

When we put forward such unconventional answers to well trodden fields, we must note that we have certainly extended ourselves beyond practiced reality.  True, this extrapolation may one day be shown to be true, but this is more of a leap of faith than any sober trust put into our methods.  And, that is really the crux of it.  When our universe is limited to what we have experienced, our intellectual foundation beyond that scope is weak.  No, I do not think Trumbo or Davis are ideal lead off men, but I would suggest that it is a perfectly good hypothesis to offer that they might well be ideal lead off men.

I doubt when tens of millions of dollars are at play though that we will be able to fill in our data set.


Patrick Dougherty said...

One of my concerns with evaluating the validity of our lineup recommendations against 10 years of lineups is that most managers are going to hedge and follow accepted rules of thumb when setting batting orders. It wasn't that long ago that we might have looked at a batting order led by a low-average and scoffed that he rarely recorded a hit or often failed to put the ball in play, all while ignoring his very high walk rate. That doesn't make us wrong or right. Like you said, Davis and Trumbo are probably not ideal leadoff men, but they might be fine leadoff men and to dismiss them out of hand in this era of statistical innovation would probably be shortsighted.

Also, can a guy get some standard deviations here? I want to know what percentile Davis' batting rates fall in for leadoff men!

Steve Coyne said...

If Chris Davis bats leadoff, either he would tend to receive more fastballs which he hits more often and with more authority or pitchers might pitch around him more resulting in even more walks (which would raise his high on-base-average even more). If Chris Davis bats leadoff, he would average more plate-appearances per game. With more plate-appearances, he would accumulate more homeruns, doubles, singles, and walks. With his typically-higher strikeout rate, he would see more pitches (which tires the opposing pitcher faster and ultimately helps his team) and ground into fewer double-plays. His excellent speed and baserunning allow him to score even more runs and ground into fewer double plays as well. Chris Davis is a very good clutch hitter - he often leads the team in driving in runs which tie the game or put his team ahead. Considering all of these points, I would fully support having Chris Davis as our leadoff hitter!

Roger said...

Another statistical blip worth considering is that the O's seem to have a lot of solo home runs, presumably because of poor OBP at the top. If batting Davis 4th is just going toi produce a lot of solo home runs then I would think the having him lead off would not reduce his RBI count that much (even if we don't consider RBI's to be a significant statistic), but it might increase his runs scored count. If the O's tend to hit HRs up and down the lineup, it would seem to make sense to put the high OBP HR hitters at the top (Davis, Machado, Smith) and the lower OBP HR hitters in the middle (Trumbo, Jones) and you could improve Davis' ability to drive in runs by putting some good OBP types at the bottom (Kim, Rickard).