14 February 2017

Using Statcast to Project Isolated Power

Through my youth, there were basically only a few statistics one needed to know.  Homeruns, batting average, runs batted in, stolen bases, and, only if you were a bit wonky and had time on your hands, doubles and walks.  My youth was largely dictated by what someone back in the late 1800s who knew more about cricket than baseball thought newspaper readers would want to know.  In the past 16 years, though, we have experienced a renaissance as fans and clubs have begun to use computers, statistics, and approaches developed in other field to know more about the game.

Some of these new approaches required new technology.  One such approach is Statcast, which uses multiple cameras to identify elements such as players, the baseball, and a bat.  This is combined with radar data and the reward is a ton of data.  While one can use this data to evaluate pitchers, baserunners, and fielders, we will be using it in this column to discern ability in hitters.  Specifically, whether exit velocity of a batted ball can be related to the power metric, isolated power.  And, then, if one or two seasons of exit velocity data can be used to accurately project future performance.

Now, the first step in figuring out how useful these measurements might be is to compare them in season.  I only looked at player who had 300 plate appearances in both 2015 and 2016.  What we find is that average hit distance (forgive the error in the graphic below, it is average hit distance not home run distance) and barrel rate correlate very strongly with isolated power in the same year (p < 0.01 for both variables).  This means that these two ways to measure velocity and contact quality are related in-season to isolated power.  The regression model connecting those measurements to the metric isolated power was also significant (<0 .01="" p="">
Below is a graph comparing expected ISO with ISO for 2015 with the accompanying R2 value:

So all of this informs us that hit quality is connected to isolated power.  That should be obvious, but it is helpful to be able to see that here.  However, what we are really interested in is whether these values are meaningful from one year to the next.  In other words, is this simply a descriptive correlation or is it a predictive correlation. 

We will do a very simple comparison.  We will simply compare R2 values for expected ISO using the 2015 developed model vs. 2016's ISO.  This simple comparison will help show whether the formula using Statcast measurements better correlates with next season's values than simply using the actual ISO from the year before.  The comparison between 2015 ISO and 2016 ISO is not shown, but the R2 was 0.5026

What we find is that the Statcast method improves the predictive capability by about 15%.  That is remarkable, but is not earth shattering.  If your decision making process was simply finding the players with strong ISO, then this technique would help but it might take a decade or so for that to be able to be seen through the noise.  In general, I do not find this to be much of a silver bullet.  That said, it may be a hesitant flag for some players and suggest some players that should be expected to regress downward or upward.

Here is a list of players who the model thinks most underperformed.  In other words, who does this model think should have had a bigger 2016 than they actually did.

2016 ISO
2016 xISO
Miguel Cabrera .247 .295 .048
Josh Harrison .105 .147 .042
Brandon Belt .199 .239 .040
Howie Kendrick .111 .149 .038
Kendrys Morales .204 .242 .038
Buster Posey .147 .184 .037
Albert Pujols .189 .226 .037
Alex Gordon .160 .197 .037
Adeiny Hechavarria .075 .109 .034
Yonder Alonso .114 .147 .033
Troy Tulowitzki .189 .222 .033
Nick Markakis .129 .161 .032
Mitch Moreland .189 .220 .031
Yadier Molina .120 .151 .031
Adam Jones .171 .201 .030

One name that jumped out to me was Kendrys Morales.  He had a solid year last year, but the model thinks it should have been considerably better.  If the model better accounts for his talent, then we might see something closer to that expected isolated power.  It may well be that playing in Kansas City depressed his value a bit and some of his hard hit balls should have fallen in.  A different point of view would be that perhaps his isolated power was depressed because he is below average in converting singles into doubles.  That might explain why Pujols is up here as well.

Here is a list of players who the model thinks most overperformed:

2016 ISO
2016 xISO
Brian Dozier .278 .201 -.077
Nolan Arenado .275 .224 -.051
Mookie Betts .216 .170 -.046
Robinson Cano .235 .190 -.045
Ryan Braun .233 .189 -.044
Curtis Granderson .228 .187 -.041
Jay Bruce .256 .216 -.040
Edwin Encarnacion .266 .229 -.037
Zack Cozart .172 .137 -.035
Anthony Rizzo .252 .217 -.035
Ben Zobrist .174 .140 -.034
Carlos Santana .239 .208 -.031
Didi Gregorius .171 .140 -.031
Jose Bautista .217 .187 -.030
Gregory Polanco .205 .175 -.030
Josh Donaldson .265 .235 -.030
Rougned Odor .231 .201 -.030

I would have thought that the model would list speedster after speedster, guys who stretch singles into doubles.  That does not appear to be the case here.  Many of these players are rather plodding.  The closest Oriole on this list is Jonathan Schoop who comes in at a -.024, which is not a good thing to hear given how uneven and somewhat underwhelming his season was last year.

This made me wonder though about how things change over time.  For instance, is the over or under production from batted ball performance to expected batted ball performance a skill.  Are over producers always over producers.  What was remarkable was that the average difference between 2016's difference and 2014's difference was .011.  The greatest difference was .046.  This suggests that there is some element that I am missing.  The ability to over or under produce appears to be repeatable, so therefore likely having to do with a skill.  The next step is finding that skill.


Richard Hershberger said...

"My youth was largely dictated by what someone back in the late 1800s who knew more about cricket than baseball thought newspaper readers would want to know."

Nope. That is the 1850s, when Batting Average started out as runs scored per game played. This makes perfect sense in cricket. Then there was a half century or so of discussion about what makes sense in baseball. Batting average settled fairly quickly as hits per at bat, but that includes a lot of hidden detail in the definitions of both "hit" and "at bat." In 1887, for example, they included bases on balls at hits. This was controversial, and only lasted one year. Then it was followed with over a century of "Look at those whacky 1887 batting averages! They sure didn't understand baseball back then!" Now, of course, we recognize this as On Base Percentage, and we recognize that it is a better measure than Batting Average. That 1887 idea doesn't seem so crazy, now.

The traditional stats came from a process. The answers weren't always right, and they were constrained by the fact that the math was all done by hand. But they weren't just the first thought that popped into mind.

Jon Shepherd said...

Yep, I wrote about it extensively in my reviews for But Didn't We Have Fun? and The Sabermetric Revolution. There was certainly a process, but it was not a strenuous one or codified in any serious way. I would argue though that they were exactly what popped into the head to explain what happened quickly in a newspaper. It was heavily influenced by statistics used in cricket.

Richard Hershberger said...

There is danger of mentally compressing the past. The Morris book is about the 1850s and '60s. The process of establishing what became the traditional stats started then, but didn't finish until the early 20th century. The conditions in which baseball was played in the 1880s was very different from those of the 1860s. Cricket was hugely influential on baseball in the 1850s, less so in the 1860s, and hardly at all in the 1880s. The tell is that the 1880s is when you start to see newspaper articles explaining cricket to their readers. It was still enough of a thing that the topic arose, but no so much of a thing that a writer could assume his readers were familiar with it.

Another trap is to make the story about Henry Chadwick, who did indeed start out writing about cricket before moving to the more remunerative field of baseball. The thing about Chadwick is that he was around for half a century, writing the whole time since this was his primary source of income. His actual importance is a moving target. Chadwick was a central figure in the 1860s. In the 1870s he was on the sidelines, but still an influential voice. He would make use of his access to the press and show up at meetings where rule changes were being discussed with his proposed version neatly printed up. He would then drop broad hints about how convenient it would be to simply vote in this version verbatim. In particular, he pushed hard for a tenth man, to play at right shortstop. By the 1880s he was widely regarded as something of an old fogy, prone to evoking eye-rolling. Then he managed the neat trick of transitioning into "elder statesman" status: the elderly uncle who sits at the head of the table at family gatherings, but who has no say in anything important. A lot of received early baseball history comes from him in this period, and he routinely exaggerated his own importance. His Hall of Fame plaque is a creative work of fiction. (Spalding, by the way, did the same thing, putting himself at the center of events of the 1870s.)

The upshot is that there is a tendency to conflate the story of early baseball stats and the story of Henry Chadwick, as if they were the same thing. This is substantially true in the 1860s and into the '70s, but not at all true by the 1880s onward.

Jon Shepherd said...

I guess I have trouble not seeing how his earlier influence set up what came next. I certainly respect your work and recognize you have looked into this period far more intensely than I have. I guess I would look to Chadwick's importance to statistics as I would look to the originally published Knickerbocker rules. They did not dictate what happened next but for anything to happen next they needed good reason to deviate. Casting a primary model is, I think, incredibly influential.

Again though I am aware of your work. And if a reader comes across this comment they should see you as a more expert voice on this and me as sitting in an informed peanut gallery.

Richard Hershberger said...

You make a fair point. The power to write the first draft is the power to set the terms of the final version, if only in outline. Baseball borrowed from cricket, via Chadwick, the understanding that its analytic stats would be averages taken from the raw stats. This is still true even with many advanced stats: ultimately a numerator and a denominator, with the discussion being what goes into those two slots. What I am reacting to is the notion that the traditional stats came about in a slapdash manner: Chadwick jots down a few notes during afternoon tea, and that is what we saw on the back of bubble gum cards when we were kids. This is unfair, untrue, and worst of all, uninteresting.

Scoring was passionately debated. Since baseball reporters were also scorers, official or not, and since baseball reporters had column inches to fill even in February, we have the record of these debates. It is fascinating to watch their thinking evolve.

By way of exanmple, is Earned Run Average about giving credit to batters or assessing blame for pitchers? We understand it today to be about pitchers, but that wasn't immediately obvious in the late 1880s, as the idea was taking shape. So there was outrage over including bases on balls in earned runs. If you think it is about crediting the batter, and you don't think working a walk is a batting skill, then this outrage makes sense. Chadwick had little to do with this debate. He was a peripheral player by this time. And cricket has even less to do with the debate. The relevant concepts simply don't apply to cricket.

Jon Shepherd said...

I agree that we often reduce history down to "Great Men" and that the development of ideas and concepts are not like thunderclaps across silent spring meadows. In 100 years, the data science slant will likely be reduced to Bill James (as it even often is now) even though his vein was present before him and now has no direct influence now. The "advanced" data science field now is so far away from James and, honestly, he is someone whose contrarian ideas are now met similarly with eye rolls.

I get that.

However, I think it is difficult to separate what he popularized and set the tone for. I would argue though that given the slant is more influenced by the scientific method that James was more thoroughly erase than Chadwick. James is a ghost to advanced statistics while Chadwick's lineage is still present in far reduced terms.

To say baseball card stats are traced back to Chadwick's application of the familiar is true. To say those stats survived challeneed is also true. The survival of those challenges, I think, does not detract from their genesis, which is more about the application of the familiar than about finding true measures of talent.

In that way, it seems our interest is not exactly the same.