A Review of Baseball Data Sources

Hello, readers! Its been a while. I’ve taken a slight hiatus from blogging, but I have a few projects in the pipeline. To get back into the spirit of blogging, I thought I would share my thoughts on useful sources for baseball data science projects. I’ve been asked about this topic a couple of times recently, hence my inspiration for this short post.

The following chart lists useful baseball data sources and corresponding resources. This is, by no means, comprehensive. However, if you’re looking to start some baseball data science projects, this list should be a decent start.

Data Source Description Resources
Lahman Database Aggregate year-by-year statistics dating back to the 1800s Documentation
Retrosheet Over 100 data points on each play of the MLB season Field Descriptions,
Download Script
Pitchfx Diagnostics on every pitch thrown by an MLB pitcher Scraper
Baseball Reference Aggregated statistics summarized in every possible way Scraper
Fangraphs All the sabermetric stats you could ever want Fangraphs Leaderboard
mlbgame Python library Game stats that can be ingested via Python Documentation