Building an At-Bat Simulator – Baseball Data Science

Simulation modeling is one of my favorite forms of analysis. Understanding how a process operates under certain conditions has always fascinated me. What is likely to happen? What is the level of uncertainty? What is the plausible range of outcomes? Simulation can help us answer those important questions.

This blog is different from my others; it walks through how to build an at-bat simulator in Python. We won’t go through all the details. Instead, we will only hit the key parts. The full code can be found at this location.

After library imports, the first section of code is a parent function, which accepts a series of probabilities (e.g. how often a pitcher surrenders a hit in a certain count, how often a player swings in a given count, etc).

Next, we have a child function, which simulates a pitch. This function inherits the probabilities passed to the parent function.

The function then makes two choices: what pitch is thrown (limited to fastball, change-up, and curveball) and if the batter swings. These are based on the probabilities passed to the parent function.

The function subsequently determines if the batter records a hit. The hit probability is an average of the following: how often the pitcher surrenders a hit in the count, how often the pitcher surrenders a hit on the pitch thrown, how often the batter records a hit in the count, and how often the batter records a hit on the pitch thrown. The results of the pitch are recorded in a dataframe we can reference later.

We can now use the function described above to determine the outcome of the 0-0 pitch.

After the 0-0 pitch, the simulation needs to act on the results of the previous pitch, which is the purpose of the following code block. If the last pitch was a swing, the simulation determines if the swing produced an out. If yes, the at bat ends. The at bat also concludes if the batter records a hit. Under both of those conditions, none of the “if” condition to continue the at bat are valid. If the batter has not recorded a hit, or if a swing has not produced an out, the simulation continues with another plausible count.

The simulation is executed within a main method after defining all of the probabilities needed for the parent function. The execution runs in a while loop, with results being appended to a dataframe.

To note, variable names within the parent and child functions shadow those in the outer scope, which is typically not desired. However, since the simulation needs a large number of specific and well-defined arguments, doing so helps keep the structure easy to understand.

Pull down the code and run your own simulations!