Predicting Tomorrow’s Outcome – Baseball Data Science

Predicting the outcome of future baseball games is notoriously difficult. Some common methodologies are slightly better than a coin flip. If you always pick the home team as the winner, you’ll be correct about 53% of the time. The Vegas betting predictions are slightly better, with about 57% accuracy.

This blog sets out to answer an ambitious question: Can we predict the result of a game using (almost) only data on the previous game?

To note, this blog is different from my other projects in that it doesn’t include data visualizations. I wanted to use this project as an opportunity to lay the foundation for more complex machine learning, so I heavily focused on the scikit-learn API. Secondly, as this is an initial stab at solving this problem, I only worked with data for one team: the KC Royals. Future work on this topic will be substantially more robust.

About the Data
The dataset was comprised of per-game statistics from 2012 – 2016 for the KC Royals. The data included staple offensive and defensive statistics as well as items like location and game length. I did include a variable for the current opponent, but the predictive algorithms mostly focus on using data from the previous game. Future work will likely include more variables for the game at hand.

Feature Selection
The initial dataset included 110 potential predictor variables, which is quite a few for this research question. I leveraged scikit-learn’s SelectKBest feature to return the 20 most predictive variables. (Remember, I’m mostly using stats from the previous game as predictor variables). Here’s what it returned:

Triples (offense)
Walks (offense)
Strikeouts (pitching)
Inherited runners (pitching)
Stolen bases (pitching)
Triples (defense)
Opponent pitcher throwing arm
Month: August
Previous game opponent: White Sox, Yankees, Cardinals, Blue Jays
Current game opponent: White Sox, Reds, Brewers, Twins, Yankees, Rays, and Blue Jays

The selection of certain teams suggests there might be some systemic advantage or disadvantage in these match-ups.

Grid Search and Model Performance
After using grid search to optimize hyperparameters, I ran four models and used five-fold cross validation to evaluate performance. Here are average F1 scores for each of the models, with “win” being the positive class in the binary classification:

Support Vector Machine: 0.69
Random Forest: 0.62
Logistic Regression: 0.62
K Nearest Neighbors: 0.48

As you can see, none of the models performed particularly well.

To note, the F1 score is the harmonic mean of precision and recall. In this case, precision answers the question, “of all the games were said we wins, how many actually were wins?” Recall answers the question, “did we locate all the wins?”

As a final step, I used the SVM to predict on a test set of data the algorithm had never seen, which returned an F1 score of 0.68. However, the straight-up accuracy of the model was just 0.52, slightly lower than the home vs. away benchmark discussed above.

What’s Next?
Additional work will be conducted on this topic. More data – in terms of features and observations – will be used to train the model. This first attempt, though, was useful to lay groundwork and provide some initial benchmarks.