The data for this project came from the Lahman database, a respected record of historical baseball data, which I downloaded and placed in a local MySQL database. The database includes a hall of fame table, which includes year-by-year ballot results. My final data set included 716 unique players, 137 of which were inducted into the hall of fame (19%). The original data set was a bit larger (1,260 unique players, coaches, media, and executives), though for this project I opted to only focus on position players. I also removed hall of famers who were selected for historical purposes rather than voted in.

Feature selection is the process of determining what variables to use in a machine learning model, whereas feature engineering involves using existing data to develop new, informative variables. In this particular problem, almost no variable is out of the question; the baseball writers who cast their ballots can survey as much information about players as they desire. For this problem, I considered the factors voters have *likely* put most weight on: awards, world series titles, hits, average, etc. Some of these variables are highly correlated (e.g. hits and average), though voters probably factor in both items. For instance, a voter likely views a hitter with 2,500 hits and a .300 average different than a batter with 2,500 hits and a .280 average.

The table below displays the features used in this model. Though I am a fan of advanced statistics like WAR, such metrics have probably not guided voters’ decisions over the years.

Variable |
---|

Batting Hand |

Throwing Arm |

Walks |

Games Played |

Golden Gloves |

Hits |

Home Runs |

MVP Awards |

Runs |

RBIs |

Stolen Bases |

Fielding Assists |

Fielding Put Outs |

Errors |

Debut Decade |

First Year on Ballot |

Most Frequent Position |

Most Frequent Team |

All-Star Appearances |

On Base Percentage |

Slugging Percentage |

Batting Average |

Postseason Batting Average |

Postseason Games |

World Series Wins |

World Series Loses |

Implicated in the Mitchell Report or Suspended for Steroids |

The inclusion of a variable to indicate whether or not a player was connected to steroids was crucial to help the model understand why, say, Barry Bonds has not been elected. To note, Gary Sheffield was not suspended or listed in the Mitchell report, so he was not labeled as “implicated”, though he has been connected to steroids in recent years. (This items was something I noticed after I had written most of this piece).

I applied multiple machine learning techniques within two classes of models: classification and regression. The classification models focused on predicting if someone will or will not be inducted into the hall of fame. The regression models centered on predicting the average percentage of votes a player will receive. (A player can appear on the ballot for multiple years).

The results of the classification models were strong. In theory, a well-constructed model should perform well on such a problem. Hall of famers are, by definition, distinct.

The evaluation metric employed was ROC AUC; basically, it measures the trade-off between false positives and false negatives. A score of 1.0 is the max. In this instance, a false positive occurs when the model says someone was elected but they actually were not. A false negative is the opposite, occurring when the model asserts that someone was not elected when, in fact, they were. The chart below displays the scores on the test set for different types of models. The test set is a hold-out set of data the model has not previously seen but for which we know the right answers to evaluate performance.

Model | ROC AUC Score on Test Set |
---|---|

Random Forest | 0.926 |

Extra Trees | 0.916 |

Gradient Boosting | 0.928 |

ADA Boost | 0.914 |

Logistic Regression | 0.901 |

Support Vector Machine (with polynomial kernel) | 0.898 |

Multi-Layer Perceptron | 0.899 |

In addition, I ran models where these classifiers voted on the outcome or were stacked together in an effort to improve predictive accuracy. In a voting classifier, all models vote on the outcome of a particular case, and those results are averaged. In a stacked model, a statistical model (called a meta-estimator) makes predictions based on the patterns of predicted probabilities produced by each model.

Those scores are below, but as we can see, these more advanced models did not improve performance. To note, all voting and stacking classifiers employed the models listed in the above table.

- Voting classifier with all models voting on the outcome: 0.924
- Stacked classifier using the best tuned models and gradient boosting as the meta-estimator: 0.737
- Stacked classifier using the best tuned models and logistic regression as the meta-estimator: 0.927
- Stacked classifier using un-tuned models and gradient boosting as the meta-estimator: 0.873
- Stacked classifier using un-tuned models and logistic regression as the meta-estimator: 0.910
- Stacked classifier using a logistic regression on columns with numeric data, gradient boosting on categorical columns, and random forest as the meta-estimator: 0.723

In the above bulleted list, you might have noticed the terms “tuned” and “un-tuned”. Machine learning models have hyper-parameters, which are essentially knobs that can be turned to improve performance. For a stand-alone model, a data scientist almost always wants to discover the optimal combination of hyper-paramaters through a process called grid search. For stacking and voting, however, heavily-tuning models may not yield better performance. In fact, it may hinder performance. In the above example, we see that the logistic regression meta-estimator performed better with tuned models, whereas the gradient boosting meta-estimator was stronger with un-tuned models

At the end of the day, the stand-alone gradient boosting classifier had the strongest performance.

Whew! We’ve covered quite a bit of ground, but we’ve yet to review the regression results. As a reminder, the goal of the regression models was the predict the average percentage of votes a player will receive. This problem is more difficult. Instead of “yes” or “no”, the model must return a specific value.

The results of the regression models are OK but by no means outstanding. My focus was mostly on building a strong classification model, so I didn’t spend a ton of time squeezing out performance in the regression setting. The evaluation metric I used was root-mean-square error (RMSE), which essentially tells us how much the predictions differ from reality. In our case, an RMSE of 0.10 would mean that predictions are typically off by 10%.

Below are the results for the different regression models that were applied to the data set.

To note, it is a little peculiar that lasso and elastic net returned the same RSME, though I couldn’t find a bug in my code as the explanation.

Model | RMSE on the Test Set |
---|---|

Ridge | 8.3% |

Lasso | 9.3% |

Elastic Net | 9.3% |

Random Forest Regressor | 8.1% |

Stochastic Gradient Descent | 8.7% |

Gradient Boosting Regressor | 8.0% |

Ordinary Least Squares with polynomial transformation applied to the data | 9.2% |

One of the benefits of the gradient boosting classifier is that it produces feature importance scores. Below is a plot of the most important features. Read these as relative values. The variables with the top scores are not too surprising.

Below is a web app that explores the model’s predicted probability for each player being elected. A value of greater than 0.50 means the model believes the player is a hall of famer.

Here are some of the notable instances where the model was wrong. Full results can be explored in the app.

- Bill Dahlen (an old-school player): 91% probability
- Steve Garvey: 81% probability
- Bob Meusel: 79% probability
- Dave Parker: 68% probability

- Willard Brown (played one year in the MLB but was mostly elected for Negro League play): 0.7% probability
- Jeff Bagwell: 11% probability
- Roy Campanella: 15% probability
- Phil Rizutto: 33% probability

Here’s the big question: Is the model’s assessment correct or are the voter’s correct? It’s likely that human biases play a role in who gets elected. Is this desirable? Should emotion play a role? Or would a model that only looks at the numbers be more fair?

I also leveraged this model to predict future hall of famers. Notable players are shown below.

This was one of my favorite baseball analytics projects. I found the model results to be fascinating and to bring forth the following question: Should machine learning play a role in the actual hall of fame voting?

I’ll leave that one up for discussion. Thanks for reading.

]]> This blog applies a simulation model to investigate how * good * or * not so good * the Royals could be. For the six core, everyday players on the team, this model simulates 100 WAR values. These players are Salvador Perez, Eric Hosmer, Alcides Escobar, Mike Moustakes, Alex Gordon, and Lorenzo Cain. This methodology can help us see, for example, how likely it would be for these players to combine for 25+ WARs or even 30+ WARs. For those unfamiliar with the concept of WAR, it’s an acronym for Wins Above Replacement, a value for how many wins a player adds over an average, replacement-level player.

The simulation model is built upon each player’s historical WAR mean and range. While useful, this model is not perfect, as it does not account for improvement or decline over time.

Before looking at the results, let’s play a little game. Can you guess the average of Hosmer’s simulated WAR values? You’ll get two guesses. * Be sure to click the ‘Submit guess’ button after entering your guess*.

Now, let’s dive into the results. The bubble chart below shows the range of positive simulated WAR values. Each bubble has the first letter of the player’s name, and the number represents the simulation trial (1-100).

That is a bit overwhelming. Let’s try a heatmap instead. We see that Salvy almost always falls between 2-4 WARs. Moustakes often has fewer than three WARs, which is heavily impacted by the struggles early in his career. Hosmer and Escobar often skew to the lower end of the spectrum, too. Gordon has several high marks due to some really strong WARs a few years ago. Likewise, Cain often lands squarely in the 3-4 WAR range.

Based on these results, it doesn’t seem likely this cohort could collectively produce 25+ WARs. A more likely amount is 15 WARs, which isn’t that impressive of a number for six players combined.

These results may seem surprising to some Royals fans. However, many advanced metrics have never been kind to players like Hosmer, Escobar, and Perez. Additionally, the model does not weight the age of the historical WAR values. We know that Moustakes has gotten better and Gordon has gotten worse recently. Future work should use a weighted WAR.

To close, for those interested, here are individual simulation plots to show the nuance for each player. Each line represents one of the simulation results.

]]>This is intended to be a short post, mostly focusing on data viz.

Did the Royals strikeout-to-walk ratio improve over time? Well, no. It has actually increased. There goes that theory.

Did the team’s OBP and and SLG improve? According to the visualization below, we see the answer is “yes.” For a closer look, click the Adjust Axis button. You can also zoom in further by double-clicking or by using your computer’s track pad.

Now, what about outbursts of home runs? *Click refresh on your browser window to play the animation of home run totals by game*. Anecdotally, we see some big spikes later in the season.

OK, I cheated on this next visualization and made it with ggplot in R instead of d3 like the others. Are the Royals hitting earlier or later in the count? According to this visualization, there doesn’t seem to be much difference.

Some of the hypotheses panned out…and some clearly didn’t. One of the things I love about data viz is its ability to quickly confirm (and necessitate the need to prove mathematically) or refute a hypothesis.

]]>**About the Data and Approach**

The dataset is comprised of (almost) all 30K of Max Scherzer’s MLB pitches. The data includes the counts and sequences of all his pitches, which is what we need for a Markov chain, the chosen approach for this problem. A Hidden Markov Model (HMM) is comprised of observed states and the latent states that determine them. This modeling technique essentially outputs how states transition from one to another.

Here’s how a HMM works in this application of predicting the next pitch.

We first calculate transition probabilities, which is how often a certain type of pitch is followed by another type of pitch. For example, after throwing a fastball, there might be a 50% probability of throwing another fastball, a 25% probability of throwing a curveball, a 15% probability of throwing a slider, and a 10% probability of throwing a change-up.

We next calculate emission probabilities. Given a certain pitch, the emission probability returns the likelihood of a certain count (e.g. 3-2, 1-2). For example, given the pitch was a fastball, what is the probability of the situation being a 3-1 count?

Lastly, we use the Viterbi algorithm to find the most likely sequence of hidden states. In this instance, a hidden state would be a “pitch.” Essentially the idea is to see if, given the count and the predicted previous pitch, the Viterbi algorithm can reproduce the pitches thrown.

**Transition Probabilities
**Below are Scherzer’s pitch transition probabilities for fastball, change-up, curveball, slider, and “other”. (“Other” is a catchall for the small number of cutters and sinkers he throws). As we know, Scherzer goes to his fastball a lot. After any given pitch, he is pretty likely to follow-up with a fastball.

**Emission Probabilities**

The chart below displays Scherzer’s pitch emission probabilities, which, given a certain pitch, returns the likelihood of a certain count. For example, many of the dots on the fastball line are comparatively close together. However, the dots on the curveball line are more spread out. Therefore, given *just *a count, we can more easily predict whether or not the pitch was a curveball than if it was a fastball.

**Viterbi Algorithm**

I fed the Viterbi algorithm the sequence of counts for all of Scherzer’s pitches in his career. The algorithm uses the transition and emission probabilities to predict the sequence of pitches. (This approach isn’t perfect, as the last pitch in a game won’t necessarily be correlated to the first pitch in the following game, but this small glitch will only impact the first observation in a new game).

Well, the algorithm has pretty poor accuracy, under 50%, mostly predicting fastball in all scenarios. Some adjustments could likely be made to improve performance, though we *expect *to have pretty low accuracy on such a problem.

I wanted to analyze a broader set of speeches, but I could not locate an archive of speeches. The texts of the three above, though, were easily found from a Google Search.

**Lexical Dispersion**

One of the more interesting elements of text mining is lexical dispersion (that is, where words appear in a corpus of text). The below lexical dispersion plots display when Griffey, Smoltz, and Maddux mentioned the names of their teams. To note, it’s interesting that Maddux never actually said “Braves”, instead always referring to the team as Atlanta.

**
**Lexical Dispersion for Ken Griffey Jr:

Lexical Dispersion for John Smoltz:

Lexical Dispersion for Greg Maddux:

**Uni-Grams**

Another staple of text mining is inspecting the most used words, which can help us quickly understand high-level themes of a corpus of text. Unsurprisingly, Griffey mentioned his dad a lot; Smoltz uttered a variant of “thanks” more than 40 times; Maddux had an emphasis on the word “first.”

**Bi-Grams**

Bi-grams are a set of two words used together. A HoF speech is a pretty small corpus of text, so there won’t be too many bi-grams. In fact, Maddux didn’t have enough bi-grams to really even make a plot. In the graphs below, we see bi-grams like “spring training” or “work hard” or “Tommy John.” Nothing too surprising. To note, words are “stemmed”, so items like “spring training” and “spring training’s” are classified as one term.

**Latent Dirichlet Allocation**

Lastly, Latent Dirichlet Allocation is a methodology that identifies underlying topics in a speech. Again, each HoF speech is not too large, so topics will not be the most robust in this setting. The tables below show the top words in the five identified topics for each speech. Some make sense, while others are scattered. For example, the words in Griffey’s Topic 2 go together fairly well, while those in Topic 3 really do not.

LDA results for Griffey:

LDA results for Smoltz:

LDA results for Maddux:

]]>**About the Data**

Career lengths are measured in days (including off seasons), and this dataset includes position players from 1985 – 2016. One of the benefits of survival analysis is that it can handle censored data. In this case, current players are censored – they haven’t yet retired. By marking these players as being “censored”, we can still include this data in a survival model.

**Survival Curves**

Survival curves display the survival probability at different points in time. In the below survival curve of baseball career lengths, we see, for example, that few players remain after 6,000 days (about 16 years).

We can also break out survival curves into classes of players. The plot below displays survival curves by switch-hitter, right-handed hitter, and left-handed hitter. Interestingly, switch-hitters retire at a much slower pace until the 4,000 day mark (green line). On the surface, this makes sense – switch-hitters tend to be good athletes, so it stands to reasons they would possess longevity.

What about survival curves by how players hit *and *throw? That, too, can be done and is shown below. When we break out switch-hitters into how they throw (left-handed or right-handed), we get a pretty small sample size, so I don’t want to put a ton of stock into those curves (though they are interesting). We are able to clearly see that players who hit left-handed and throw right-handed have higher probabilities of lasting longer compared to lefty-lefty and righty-righty players. We also see that players who hit right-handed but throw left-handed, which is pretty rare, are typically quicker to retire.

What about some more intricate survival curves? Below are survival curves for career length by birth country and by height. Both are pretty messy, though we can see career lengths differ on these dimensions.

**Cox PH Regression**

Cox PH regression allows us to predict time-to-event variables. To more accurately assess the impact of multiple variables on career length, I built a Cox regression model with the following predictors:

- average salary (with all salaries being converted into 2016 dollars)
- birth country
- weight
- height
- right-handed or left-handed hitter
- right-handed or left-handed thrower
- age at time of MLB debut

I didn’t include traditional stats like batting average or home runs. It’s fairly obvious that good players last longer.

Four variables are statistically significant: weight, height, throwing hand, and age at debut. Increases in both height and weight lead to increases in the hazard ratio, which indicates worse survival (i.e. shorter careers). That is, o*n average*, bigger players have shorter careers, which makes sense in some respects. (However, this relationship is likely non-linear, and this linear approach surely blands out interesting nuance). Likewise, increases in a player’s MLB debut age correspond to increases in the hazard ratio (i.e. shorter careers). This is a no-brainer and encouraging the model didn’t miss this item. Interestingly, on average, throwing right-handed decreases the hazard ratio. That is, right-handed throwers have longer careers than left-handed throwers. The below survival curve bears this out.

Thoughts or questions? Leave them for me in the comments!

]]>**Data Source**

Using Python, I ingested player data from a MySQL instance of the Lahman database and performed some simple wrangling. I prefer to conduct factor analysis in R, so I wrote the data to a .csv and loaded it into R Studio for analysis and visualization.

**Offensive Statistics**

A few statistical tests (e.g. looking at eigenvalues) revealed three factors to be optimal for the offensive data. The results are intuitive. Factor 1 corresponds to just plain good hitters – think doubles guys like George Brett and Roberto Clemente – with factors loading heavily on runs, hits, doubles, and RBIs. Factor 2 looks like lead-off guys – think players like Tim Raines – with factors loading heavily on games, at bats, stolen bases, and caught stealings. Lastly, Factor 3 corresponds to the power hitters, loading strongly on home runs and strikeouts.

We can also view the results as a path diagram, again showing which variables most strongly connect to our three factors (connections with low scores are eliminated for better visualization).

**Pitching Statistics
**What about pitching stats? Statistical tests showed that three factors are optimal for this data. These results are also fairly intuitive. Factor 1 looks like old-time starters (like Cy Young), with factors loading heavily on both wins and losses as well as complete games. Factor 2 appears to be new-school starters (like Randy Johnson), loading strongly on strikeouts and home runs. Lastly, Factor 3 clearly represents relievers, loading heavily on games and saves.

Again, we can also view the results as a path diagram (connections with low scores are eliminated for better visualization).

Pretty interesting, huh? I’ve always found factor analysis to be a useful methodology, and that again proved to be the case with HoF data.

]]>This blog takes advantage of the ability to compare baseball players over time. Leveraging a framework often used to construct recommendation systems, I built a Python script based on standard, aggregated pitching stats that surfaced pitchers with similar seasons. (Standard stats are items like ERA, home runs allowed per game, walks issued per game, etc). Basically, the program “recommended” similar pitchers.

For example, which pitchers had seasons most similar to Curt Schilling’s 2001 season? Well, we see 2002 Curt Schilling (93% similarity), 2016 Max Scherzer (89% similarity), and 1996 John Smoltz (88% similarity). What about Greg Maddux’s 1992 season? We don’t see too many of his contemporaries. In fact, 12 of the top 15 most similar seasons were prior to 1990 (including Dazzy Vance from 1928).

If we look at Roger Clemens’ 1997 season, we see an interesting mix of contemporaries and players from previous decades.

What about Randy Johnson’s dominant 2002 season? In the chart below, the more complete the circle (the closer it is to one), the more similar the season to Johnson’s 2002 performance. In addition to several contemporaries, we see similar seasons by Sandy Koufax and Jim Maloney (who won 23 games in 1963). However, well, Randy Johnson was most similar to earlier versions of himself.

Let’s look at a slightly more recent season, Zack Greinke’s performance in 2009. The table below displays a selection of player stats (re-scaled to be 0 -100) for the seasons most similar to Greinke’s 2009 performance. Remember Kevin Brown?Though there is overall similarity among pitchers, which can be more than 90% overall, not all players perform in lock step on all metrics. The chart below compares stats for the seasons most similar to Pedro Martinez in 2000, with each line representing a pitcher (top similarities: 2015 Clayton Kershaw, 90% similar; 2000 Randy Johnson, 87% similar; 2001 Randy Johnson, 86% similar). Though these seasons are more than 80% similar to Pedro in 2000, we don’t see too many lines that move in lock-step. While these stats are in the *same range*, it’s interesting to note there is *variation within that range*.

Next up, we’ll apply some novel statistical methods to better understand players and teams. Stay tuned for items like Markov models and general additive models!

]]>So, how bad have the Royals been in comparison to other teams? What exactly is wrong with them? This blog provides some answers to those questions, specifically looking at offensive metrics, though this piece is far from comprehensive.

(Note, the dataset I’m using denotes the Royals as KCA).

**The Royals don’t hit in horrible counts…but they don’t hit in great counts either
**The Royals are decently aggressive, ranking in the middle in terms of percentage of plays ending on 0-0 counts.

The Royals still rank in the middle in terms of plays ending with an 0-2 count, and they rank low on the metric of 1-2 counts. Both are seemingly good outcomes.

But, there’s always a but, they rank much lower than several teams on plays ending with a hitter’s count. This is not so great.

**The Royals aren’t getting their lead-off runner on
**While not

**They are only average in strikeout-to-walk ratio
**Again, the number isn’t terrible, but it isn’t great either. Are you seeing a pattern here? The Royals’ offense isn’t terrible across the board…just average in many ways, strong in almost no department, and pretty bad in a few areas (as we’ll see soon).

**The Royals aren’t hitting for power
**Well, this really isn’t a surprise. The Royals have never been a power team, and Kauffman isn’t the most hitter-friendly park. However, with the likes of Mike Moustakes, Eric Hosmer, Salvador Perez, and Brandon Moss, a ranking

**The Royals aren’t getting on base (duh), particularly against lefties
**The Royals have struggled mightily with getting on base, particularly against lefties. Their on base percentage against lefties is dead last, and they are also at the bottom in OBP against righties. (To note, I’m using a slightly different formula for OBP – it’s more of a broad success rate – so these numbers won’t align with those reported by traditional outlets).

**The Royals aren’t frequently hitting line drives
**I looked at the Royals in comparison to a set of four teams playing quite well so far – the Astros, Nats, Rockies, and Orioles. On average, the Royals hit fewer line drives than all these teams. KC averages 5.6 per game, compared to the likes of Baltimore and Washington, who average 7+.

Clearly, the Royals have had their struggles. But as I pen this blog, the Royals are currently winning in the late innings. Perhaps a comeback is in the cards…though my head cannot quite ignore the numbers in this post.

]]>This blog sets out to answer an ambitious question: Can we predict the result of a game using (almost) only data on the previous game?

To note, this blog is different from my other projects in that it doesn’t include data visualizations. I wanted to use this project as an opportunity to lay the foundation for more complex machine learning, so I heavily focused on the scikit-learn API. Secondly, as this is an initial stab at solving this problem, I only worked with data for one team: the KC Royals. Future work on this topic will be substantially more robust.

**About the Data
**The dataset was comprised of per-game statistics from 2012 – 2016 for the KC Royals. The data included staple offensive and defensive statistics as well as items like location and game length. I did include a variable for the current opponent, but the predictive algorithms mostly focus on using data from the previous game. Future work will likely include more variables for the game at hand.

**Feature Selection
**The initial dataset included 110 potential predictor variables, which is quite a few for this research question. I leveraged scikit-learn’s SelectKBest feature to return the 20 most predictive variables. (Remember, I’m mostly using stats from the previous game as predictor variables). Here’s what it returned:

- Triples (offense)
- Walks (offense)
- Strikeouts (pitching)
- Inherited runners (pitching)
- Stolen bases (pitching)
- Triples (defense)
- Opponent pitcher throwing arm
- Month: August
- Previous game opponent: White Sox, Yankees, Cardinals, Blue Jays
- Current game opponent: White Sox, Reds, Brewers, Twins, Yankees, Rays, and Blue Jays

The selection of certain teams suggests there might be some systemic advantage or disadvantage in these match-ups.

**Grid Search and Model Performance
**After using grid search to optimize hyperparameters, I ran four models and used five-fold cross validation to evaluate performance. Here are average F1 scores for each of the models, with “win” being the positive class in the binary classification:

- Support Vector Machine: 0.69
- Random Forest: 0.62
- Logistic Regression: 0.62
- K Nearest Neighbors: 0.48

As you can see, none of the models performed particularly well.

To note, the F1 score is the harmonic mean of precision and recall. In this case, precision answers the question, “of all the games were said we wins, how many actually were wins?” Recall answers the question, “did we locate all the wins?”

As a final step, I used the SVM to predict on a test set of data the algorithm had never seen, which returned an F1 score of 0.68. However, the straight-up accuracy of the model was just 0.52, slightly lower than the home vs. away benchmark discussed above.

**What’s Next?
**Additional work will be conducted on this topic. More data – in terms of features and observations – will be used to train the model. This first attempt, though, was useful to lay groundwork and provide some initial benchmarks.