Factor analysis is a methodology to reduce data complexity. Factors are latent variables that represent underlying constructs in data. An interesting baseball-related application involves using factor analysis on data for hall of famers. Based on staple offensive and pitching statistics, can we find underlying classes of players in the hall of fame?
Using Python, I ingested player data from a MySQL instance of the Lahman database and performed some simple wrangling. I prefer to conduct factor analysis in R, so I wrote the data to a .csv and loaded it into R Studio for analysis and visualization.
A few statistical tests (e.g. looking at eigenvalues) revealed three factors to be optimal for the offensive data. The results are intuitive. Factor 1 corresponds to just plain good hitters – think doubles guys like George Brett and Roberto Clemente – with factors loading heavily on runs, hits, doubles, and RBIs. Factor 2 looks like lead-off guys – think players like Tim Raines – with factors loading heavily on games, at bats, stolen bases, and caught stealings. Lastly, Factor 3 corresponds to the power hitters, loading strongly on home runs and strikeouts.
We can also view the results as a path diagram, again showing which variables most strongly connect to our three factors (connections with low scores are eliminated for better visualization).
What about pitching stats? Statistical tests showed that three factors are optimal for this data. These results are also fairly intuitive. Factor 1 looks like old-time starters (like Cy Young), with factors loading heavily on both wins and losses as well as complete games. Factor 2 appears to be new-school starters (like Randy Johnson), loading strongly on strikeouts and home runs. Lastly, Factor 3 clearly represents relievers, loading heavily on games and saves.
Again, we can also view the results as a path diagram (connections with low scores are eliminated for better visualization).
Pretty interesting, huh? I’ve always found factor analysis to be a useful methodology, and that again proved to be the case with HoF data.