The Data Science Behind Baseball Pitching Strategy

The Data Science Behind Baseball Pitching Strategy

Data Science Project

In every Major League Baseball (MLB) stadium across the country, special cameras are installed that track the flight of the ball from the pitcher’s hand through the strike zone. This pitch tracking system, known as PITCHf/x, made its debut in the 2006 MLB playoffs.

PitchF/x tracks the position of the ball, speed, and break, outcomes like hits, strikes, foul balls, and so on, and uses a machine learning algorithm to categorize the pitch type (e.g.: fourseam fastball, changeup, curveball, slider). For the most part, this data is used to track statistics and generate the graphics you see during a live baseball game cast, but it can also be used to analyze the subtle, mental battles that take place between pitcher and hitter during every at-bat. Using PitchF/x, let’s revisit the drama of the 2014 World Series and analyze the baseball pitching strategy.

The End of the 2014 World Series

With game seven of the world series on the line, Kansas City Royals catcher Salvador Perez faced San Francisco Giants pitcher Madison Bumgarner. With two outs and the tying run on third base, Bumgarner threw six straight high fastballs out of the strike zone.

Screen Shot on 2015-08-17 at 16-07-51

To put it mildly, this is an unusual strategy. Even in rock-paper-scissors, where there are fewer strategic choices available, choosing rock six times in a row is a pretty unusual thing to do. Only one of the six pitches was (barely) in the zone, and besides differences in how far in or out the pitches were, there was hardly any variation in speed. Nonetheless, this approach proved successful when Perez hit a pop fly into foul territory that was caught by Pablo Sandoval to end the game. In the postgame interview, Bumgarner uncharacteristically revealed his strategy.

“I knew Perez was going to want to do something big,” Bumgarner said. “We tried to use that aggressiveness and throw our pitches up in the zone. It’s a little bit higher than high, I guess, and fortunately I was able to get some past him.”

In this case, Bumgarner knew Perez’s tendencies and was in tune with the psychology of the moment; Bumgarner masterfully used the knowledge against his opponent to minimize risk and get an out without throwing a “hittable” pitch in the strike zone.

The Game Within the Game

Strategic pitching concerns are an underexplored application of PitchF/x data. Each at-bat is not only a sequence of independent pitches, but in fact a whole strategic game between pitcher and batter—a game which may extend back years.

Conventional wisdom says that pitchers are most effective when changing speeds and forcing the hitter to change his eye level by working inside, outside, and up and down in the strike zone. And in fact, overall, this seems to be true. The chart below shows the likelihood of getting a swinging strike based only on a change in speed. Basically, the best way to get a batter to swing and miss is to take something off relative to the preceding pitch (ignoring all other factors).

Screen Shot on 2015-08-17 at 16-08-18

Both the pitcher and batter try to anticipate each others’ expectations in order to guess (hitter) or defy (pitcher) expectations of their adversary. Because of this strategic element, the game theory is extremely rich and pitchers and batters presumably exploit all kinds of information to gain an edge. This information may include previous at bats against a given adversary, previous pitches in the current at bat, and the pitcher’s strategy against other hitters. In the Bumgarner vs. Perez at-bat, Bumgarner exploited the energy of the scenario along with a detailed understanding of Perez’s tendencies to put away a dangerous hitter with virtually no risk of giving up a hit.


In this repository, I created a data set useful for modeling pitching strategy. We begin by creating a data set that includes all pitches and the pitches that preceded them. I made a model to show the differences among pitches in how likely they are to get a swinging strike.

Probability Density of Swinging Strike (prob_k) Based On Pitch Location (px_last vs. pz_last)

Screen Shot on 2015-08-17 at 16-09-02

*Pitch Definitions: CH = Changeup, CU = Curveball, FF = Fourseam Fastball, SL = Slider

The above chart shows the probability density of a swinging strike for four different pitches based on the location of the pitch in righty vs. righty scenarios. Not surprisingly, there are big differences here: breaking balls are more effective low in the zone, whereas fourseam fastballs are effective high in the zone. All pitches have a kind of halo around the plate—getting swinging strikes on pitches in the heart of the plate is fairly uncommon—but more likely for changeups and curveballs, which makes sense given that these pitches rely more on speed changes than location to deceive the hitter.

But what is the effect of strategy? How does the story change when you factor in the setup pitch? A model can capture this by looking at the difference between the overall probability densities and the probability densities associated with a particular setup pitch.

Strike Probability (prob_k) Based On Pitch Location (px_last vs pz_last) and Previous Pitch

Screen Shot on 2015-08-17 at 16-14-25

*Pitch Definitions: CH = Changeup, CU = Curveball, FF = Fourseam Fastball, SL = Slider

The columns of this grid represent the final pitch of a two-pitch sequence, while the rows represent the preceding pitch. Comparing the columns shows how different setup pitches affect the likelihood of a strike for the pitch in that row. While these absolute levels are important (because a pitcher cares about throwing the absolute most effective pitch) it’s also interesting to look at the difference between the general behavior and the “set up” behavior, as it tells us something about the strategic game being played between hitter and pitcher.

Difference in Pitch Effectiveness Based on Pitch Location and Previous Pitch

Screen Shot on 2015-08-17 at 16-09-27

*Pitch Definitions: CH = Changeup, CU = Curveball, FF = Fourseam Fastball, SL = Slider

In this view, the importance of speed would seem to become more apparent. The top row shows the effectiveness of a changeup as a setup pitch. In this row, the green shows how pitches outside the zone are more effective when set up by a changeup. On the other hand, some of the red zones (which show a reduced probability of getting a swing and a miss) seem to be associated with not changing speeds, such as setting up a curveball with a changeup, and vice versa.

A similar approach works for different at-bat outcomes too, including called strikes. There are fewer called strikes overall, and as we’ll see, the overall evidence for setting up a batter for a called strike (“freezing them”) isn’t strong—at least, not on the basis of pitch selection alone.

Probability of Called Strike (prob_k) Based on Pitch Location (px_last vs pz_last) and Previous Pitch

Screen Shot on 2015-08-17 at 16-09-55

*Pitch Definitions: CH = Changeup, CU = Curveball, FF = Fourseam Fastball, SL = Slider

In this chart, aside from some obvious data sparsity issues, there’s not much difference within the columns. This means that the previous pitch doesn’t make as much of a difference in causing a called strike as the pitcher getting the ball into the halo around the edges of the strike zone. Interestingly, this is a case where changing locations—moving up and down or in and out—might be more important than pitch-selection alone.

The Maestro

Looking at this data does shed a little light on how amazing Bumgarner’s approach was. Overall, firing off back-to-back heaters doesn’t seem to be a particularly good strategy. But in this particular case, it obviously was. It’s a credit to Bumgarner’s mastery of his craft, and perhaps a hint of some of the modeling challenges likely to come up in further work on this data!



Original work by: Isaac Laughlin, Data Science Instructor at Galvanize. Follow @lemonlaug

Edited by: