28 February 2017

Mark Trumbo is the Ideal Orioles Leadoff Hitter

Mark Trumbo - Ideal Leadoff Hitter
In the early days of this current era of data science, one managerial choice that would cause ire was the batting order.  Modelers would use newly appreciated existing metrics, on base percentage and slugging, to regress lineup position against total team runs scored.  That was based on work by past luminaries in the field, such as Cyril Morong, Tom Tango, Ken Arneson, and Ryan Armbrust.  This enabled fans to figure out what was the best lineup.  You can use that tool for yourself here.

Here, we will use that approach to assess the Orioles.  On proof of concept, we shall do something simple.  Let's assume that Welington Castillo, Chris Davis, Jonathan Schoop, Manny Machado, J.J. Hardy, Hyun-soo Kim, Adam Jones, Seth Smith, and Mark Trumbo would play every game and that their performance would be in line with 2017 ZIPS projections.  We can plug in their projected OBP and SLG to find out what lineup would be best for the Orioles.  The tool finds two lineups producing equal value and above all other lineups:

LF Hyun-soo Kim
1B Chris Davis or 3B Manny Machado
C Welington Castillo
DH Mark Trumbo
3B Manny Machado or 1B Chris Davis
2B Jonathan Schoop
CF Adam Jones
SS J.J. Hardy
RF Seth Smith

In general, parts of the lineup make sense and other areas are rather curious.  Kim leading off makes sense because the tool values leadoff men who do not make outs and, according to ZIPS, he will have a .370 OBP.  That sets the table for the batters following.  Davis or Machado following him makes sense because you want to maximize your chances of being able to score this OBP-abled Kim.  Castillo as the third hitter seems questionable, but this model acknowledges that league data shows that the third hitter in the lineup faces remarkably fewer RBI situations than hitters in the second, fourth, or fifth slots.  The rest of the lineup makes some traditional sense.  You can read up more about this kind of lineup optimization here.

Perhaps what is more interesting about the above lineup tool is that the difference between the projected best lineup and worst is 59 runs.  That difference of six to seven wins is large, but less so when you consider the worst fathomable lineups are something no manager would ever do.  Meanwhile, the best lineups are quite close to what we traditionally envision.  For instance, the worst projected lineup is one with Davis, Machado, and Trumbo filling out the final three slots in the batting order.  It would never occur to Buck to arrange his hitters like that.  So, the major take home message for all has been, in effect, lineup order rarely matters because a manager's lineup is usually incredibly similar to what this tool projects to be the best lineup.

Now, I think there are obvious problems with this tool.  By using a league wide population as a data set and then applying regression, we are assuming that each batter in each lineup position exists separate from other batters.  What I mean is that Manny Machado in this tool does not have Kim in front of him and Castillo behind him.  Machado, instead, follows the league average leadoff hitter and is followed by a league average third slot hitter.  This lack of connectivity between players is an issue.  Yes, ideas like lineup protection are poorly evidenced, but I am more referring to how hitter ability improves run scoring chances.  This makes sense.  If you have an elite OBP generator in front of you, your lineup position is potentially more productive than the league average lineup position.  Overall, that may have great impact.

With that in mind, I decided to create a new tool and run a different regression model.  This model did not consider on OBP or SLG metrics.  Those metrics were strangely revolutionary over a decade ago, but have their limitations.  They encapsulate a great deal of information that include different skills that may be useful in different scenarios.  Instead, I focused on event rates of walks, strikeouts, and various batted ball results against Runs Batted In minus Home Runs (based on the assumption that home run RBIs of the batter were lineup independent).  Each lineup position took into consideration the performance of that player, but also the players who bat before that player.  The data set I used was league wide and by team from 2007 to 2016. 

Using this approach frees ourselves from only considering a player by a context-free lineup position.  Once I developed the formulas for each batting position, I then compared the expected runs to actual runs and resulted in a trendline fit with a R(2) of 0.84.  I wondered how well the model would work if each lineup position was normalized and wound up with a R(2) of 0.68.  In other words, consideration of lineup order was a major consideration in improving the fitness between the relationship of expected runs and actual runs.

At this point, we can go back to that original dataset of nine Orioles hitters.  Remember, this is a concept piece, so we should not take this exercise as how many runs the Orioles will score or even that this lineup is universal and invulnerable to handedness.  Instead, we should merely look at this as a simple exercise to see where the different kinds of production appear to fit best using this lineup position model.

In this post, my limited coding know how leaves me unable to create a computer program to figure out the best lineup.  Therefore, I decided to go about this using some knowledge about where certain players might fit best (until Patrick Dougherty finishes the build and runs the model, which will be a later post).  I began with the assumption that Chris Davis is ideally the cleanup hitter.  From there I took the other eight hitters to see who increased his value the most in the three spots ahead of him.  What I found is that Davis has the most expected RBIs if Seth Smith, Hyun-soo Kim, and Manny Machado batted in front of him.  He would stand to see 87 RBIs in addition to his 46 HR RBIs (over the course of 162 games played).

I then moved on to Manny Machado in the third slot, which goes against the rationale of lineup optimization perspective that began this article.  While, the Smith and Kim were a good one-two punch before Machado, a little shifting around of names found a far batter solution with minimal impact to Chris Davis' projected RBIs.  The result was fairly surprising in that the model appears to think that the best one through four for the Orioles is Trumbo, Smith, Machado, and Davis.  With great certainty, I can tell you that this model is the only thing on this Earth that has suggested that Trumbo should lead off.

Before revealing the rest of this "ideal" lineup, let me explain some things about run opportunities.  Trumbo leading off does make some sense in that each position in a lineup is greatly dependent on the abilities of those who come before the player.  For instance, if you are a cleanup hitter then you will not exactly want a great OBP player leading off.  Why?  A good OBP player leading off will let the inning go to the second and third hitters.  Past the first inning, that leadoff hitter stands a good chance of batter when the worst batters in the lineup have hit right in front of him and likely were turned into outs.  Those second and third hitters that follow the leadoff hitter will also become outs the majority of the time.  This means that there is a great chance of the inning ending and the clean up hitter coming up as the first or second batter without no one on base.

That makes sense, right?  You want to maximize the batters on base immediately before you best base clearing hitter, but also isolate them enough from inferior hitters who rack up outs and put the base clearing hitter in scenarios where there is nothing on base to clear.

Well, the next question comes to why then have such an extreme home run hitter batting first and not fifth to clean up what Davis cannot get to?  The reason against that is that Davis does two things really well: (1) knocking in base runner with a lot of homeruns and (2) getting a lot of strikeouts which ends innings.  This means Trumbo has to contend with a player who will often clean the table by homerun or striking out.  With a strikeout, the inning ends or players do not move up a base.  That decreases run opportunities.  There is a logic there that the model is expressing.  It is possible that putting a secondary base cleaning threat at leadoff, you give him more plate appearances to knock himself in as well as making most of a poor situation at the bottom of the order with poor hitters racking up outs.

After some more tinkering, the final model projection is:

DH Mark Trumbo
RF Seth Smith
3B Manny Machado
1B Chris Davis
CF Adam Jones
2B Jonathan Schoop
LF Hyun-soo Kim
C Welington Castillo
SS J.J. Hardy

In the end, this lineup looks like a wholly reasonable lineup if the only thing you did was flip Trumbo and Jones.  That flip will often be made due to the belief in speed needing to be in the leadoff position, which might be a questionable conviction.  The Trumbo leadoff model suggests a 162 game production of 834 runs, while a Jones leadoff model nets 827 runs.  Seven runs, so not that big of a deal.

What is interesting is if one flips Seth Smith with Mark Trumbo.  A simple flip of the first two batters while leaving everyone else the same.  Run production drops from 834 to 797.  Thirty seven runs.  That seems very drastic to me.  Very, very, very drastic.  In the traditional data model above, a flip of two players would result in a very minor change in run production.  Is that because it would literally result in a minor change of run production or is it because the flip assumes all positions are context neutral to that position.

One other lineup to test would be this one: Kim/Smith/Machado/Davis/Trumbo/Jones/Schoop/Castillo/Hardy.  This is a very generic, normal lineup.  How is it viewed? 782 runs.  Here we have an "ideal" lineup generating 834 runs and a perfectly normal lineup getting dropped to 782 runs.  That spread is nearly equal to what the traditional model thinks the difference is between the best and worst lineups possible.

It may well be that in order to have a useful lineup optimization tool that you need to consider chaining production, linking the players in the lineup into a greater entity than just assuming a player's talent is independent of others by the assumption that they are surrounded by league average talent and abilities.

I am unsure whether I truly believe this, but, after several days of hammering it, I am at a loss as to what I might not be considering.  Have we really neglected the importance of lineup construction because of a simple overly normalized lineup tool presented over a decade ago?

10 comments:

Roger said...

At least the last lineup, with Jones and Trumbo switched as you suggested, would be something that Buck might use. The first lineup you suggested (with Machado in second) is also not a bad idea to try. Considering the O's hit as they do (K's and Hr's) and there were an awful lot of solo shots last year, a statistically predicted lineup seems like a good way to get an instant boost in production.

Jacob W Smith said...

Given that your model is based on regression analyses of individual events, I'm guessing that it doesn't have a way of counting outs? Granted your model fit surprisingly well to the league run-scoring data, but it does seem like an obvious weakness and could explain how it finds something "optimal" with the team's highest-OBP player batting in the bottom third and a guy with a projected OBP of .307 leading off. The traditional model for lineup optimization basically just convolutes effectiveness with number of opportunities, admittedly ignoring correlated events (IE hits and walks might be much more similar outcomes for players batting behind Trumbo and Schoop, who don't spend a lot of time on first and second, than for player batting behind Kim. But you may have moved too far in the opposite direction and miss major game trends (IE you put high OBP guys at the top of the order and you will get more PAs).

Jon Shepherd said...

I think what the model says is that a player is more than his OBP and Kim's ability to get on base along with his other abilities are more important elsewhere.

The whole model is based on maximizing RBIs, which indirectly values players who score runs.

The out issue is indirectly included in the model.

Real weakness of the he model is that Trumbo in the leadoff slot is a clear projection outside of the available data because teams do not bat a guy there.

Jacob W Smith said...

This was basically my interpretation too. But by handling everything on an individual basis the model inherently ignores long-term opportunity costs. Obviously Trumbo not getting on base impacts the guys batting behind him in the lineup and their opportunity to drive in runs. But it also shortens the game and ultimately costs everybody in the lineup a little more on top. If we assume that each lineup position has an equal chance to come up last in a game, swapping the guys at 1 and 7 in the order with .063 points of difference in OBP shortens the season by roughly 7 outs. That's not a huge number, but it's a few runs, and it highlights an inherent weakness of considering only individual contributions, particularly when the model is based on a regression analysis of a fairly small data set.

At first glance 10 seasons of 30 teams, thousands of individual player-seasons, feels like a big sample. But when you consider the number and range of the variables involved - including those that aren't included in the model - it really becomes quite small. Your fitting assumes that the model is reliable everywhere, but the number of guys with a specific statistical profile like Trumbo's is quite small. In fact, the sample of players from 2007-2016 with 500 PAs and within 5 points of Trumbo's projected AVG and OBP and 10 points of his projected SLG contains exactly 2015 Todd Frazier. When you further break it up into individual events you see that Frazier's HR rate definitely lags behind Trumbo's, while his doubles rate is substantially higher. If you expand the sample down to 0 PAs, you can add 2011 Chris Heisey, who is a better statistical match for Trumbo. So our reasonable comparison group for Trumbo is 1 or 2 guys, with 1 or 2 sets of teammates. RBI is a stat with huge error bars. How good of an idea can we possibly have how somebody with Trumbo's statistical profile really interacts with different types of players to produce runs? It's not just Trumbo in the leadoff spot that creates projection issues. A few thousand player seasons in a sport with massive statistical variance leave Trumbo and many others with a limited or nonexistent data set entirely.

After further thought, I'm not surprised that the regression fitting was good. If you add enough parameters you can make a model fit any data well. In such a situation it doesn't necessarily imply predictive power. Another big concern I have with this model is the apparent lack of consideration for SB/CS. Guys who steal bases - even with a success rate below the break-even point - tend to score more runs than statistically similar players who don't steal. By correlation, guys who bat behind them would tend to have more RBI. Again, given the scarcity of data, such factors could really skew the model.

It seems to me that you could obtain a similar type of model with much more predictive power through a Monte Carlo analysis initiating innings with random ordering of batters and statistical outcomes of at-bats and see how runs are scored. To do an even better job have a statistical distribution of "pitcher skills" - include logic to massage hit, walk, K, and HR rates in a manner consistent with the variability in pitching. From such an exercise you could ostensibly extract ordering of players that correlates with greatest run-scoring for this team specifically and with a much better-developed data set for the players involved. Obviously there are some people who will take issue with it not being grounded in a "real" data set, but most of those people aren't going to trust a lineup produced from a regression analysis either.

Jon Shepherd said...

My whole reply is maybe. I think it is an assumption that more PA over a year means more runs, but I think it is equally valid that ordering can produce more runs than having more players come to the plate. I recognize this goes against really baseball data conclusions, but lineup quality obviously has importance.

So, I think your point is possibly valid, but I do not find that the concept is more compelling. From a result standpoint, I comprehend the contention.

Jon Shepherd said...

Any desire to write up your concept for post here?

vilnius b. said...

Sounds like a great idea to me! Mr. Smith seems like he'd be a great contributor to this site (or any serious baseball site).
I'm not a statistician and it's been a long time since I've taken a stats course, but I'd certainly welcome seeing him contribute his analyses not only to the question of how best to order the lineup but to other baseball topics that involve statistical analysis.

Jacob W Smith said...

Jon, I can try to work on it, but I can't promise anything. In order to write anything truly interesting I'd have to run the Monte Carlo analysis. Unfortunately I'm a VERY rudimentary level coder, so I'm not 100% confident that I can build a baseball sim that will spit out sufficiently predictive data to be interesting. I should have time to work on something like that next week, so I'll try to think about what a basic sim might look like in the meantime.

A quick and dirty workaround might be to find close historical analogs to the players' projections and build several test lineups and use the WhatifSports simulation engine to test those against a few 2016 pitching staffs. If nothing else I might try that. Presumably the guys charging $13 a season to simulate baseball are doing a decent job of it. No way to do it for free on Strat...

Jon Shepherd said...

Email the Depot. ..I have a couple ideas.

KnightofGod said...

The Rangers used Brian Downing as their lead off hitter back in the day. It worked OK.