The site has been up for a couple weeks now and I wanted to get familiar with the process of coming up with relevant topics to create a search for my data model, copying and pasting the results into this WordPress blog format, and writing some explanations as to what it all means. It just so happened that the playoffs were under way providing more fodder for searches. Now it is time to step back and provide background as to what all this data means and am I just making it up. This post will be one of many because I won’t be able to organize all the information properly in one fell swoop. Eventually all of this will be put into a FAQ and/or an About page.
Why are you doing this?
This project began last April when I heard Darwin Barney, 2012 gold glove winner for the Chicago Cubs was named that team’s most valuable player with a WAR=4.8. WAR is an acronym meaning Wins Above Replacement and there are several variations from different sites. I referenced WAR derived by baseball-reference.com. What does the number 4.8 mean? If WAR is related to “value” that means Darwin Barney was worth more than Kyle Lohse who went 16-3 with a 2.86 ERA over 211 innings pitched and had a WAR=4.3. Even though the Cubs lost more than 100 games that year I didn’t think Barney was its best player let alone better than Kyle Lohse so I began to investigate.
First I attempted to analyze their mathematics and got lost. The more I read the more I became confused. At the same time I was working a data model for a networking project and got stuck. That model wasn’t right, all my entities seemed wrong, and I wondered how to make sense of it even though it was completely coded and operational. This can happen with the abstract nature of data models (see below). I couldn’t explain the model in my documentation. In order to clear my head I decided it might be fun to model something completely different like baseball. I didn’t plan on coming up with a player value system; I just wanted to see if I could define entities and their relationships and create a framework to easily count stuff using simple scripts.
I cleaned the whiteboard and started to draw an Entity Relationship Diagram (see below) for baseball. Six months and over 7000 lines of perl code later I have every MLB player since 1890 rated and ranked as well as players in many years in many minor leagues. The mathematics have proofs and those proofs are used as integrity checks that can tell if there is deficiencies in the source data or code.
What did I learn? Ryan Dempster and Alfonso Soriano were the two best Cubs in 2012 (rank column is rank amongst all MLB pitchers and batters). The purpose for this table format will be discussed later. Darwin Barney played below average even though WAR had him highly valued.
What is a Data Model?
The cut to the chase answer: A Data Model defines scope of data, its entities, and how those entities relate with each other. A properly defined data model allows various independent applications to share data seamlessly. It is used to define data requirements for software development and for business processes. Designers draw a “map” to keep track of the model called an Entity Relationship Diagram (ERD) displaying the relationships between entities in a concise format. Data modeling is an entire field of software engineering so I won’t go further than this brief introduction and definition.
[expand title=”Read On”]
What is the Scope of Data?
The scope of data defines data required for the project, defined to set limits otherwise a project becomes boundless. The scope for this data model involves stats for every baseball player and every baseball event in every league in the history of the game. Stats can be anything from game attendance to salary to batting average. If you have an app that wants to look up a plumber in the yellow pages, that would be outside the scope of this project.
If you have an app that wants to look up the count Kirk Gibson had when he hit his two run walk off home run in game one of the 1988 World Series, that information is part of the scope of this data model. Just because data falls into this scope doesn’t mean it can be retrieved. The data must first be defined to be retrieved. To find the count in Kirk Gibson’s HR requires a search through post season event entities. Post season events for 1988 are available at retrosheet.org on this page. I’ll get into more detail on the different kinds of data input later.
The scope of data for this project is rather large. The data model as currently defined is a small subset of the overall scope covering all of MLB from 1890 to present and many years of AAA, AA, A+, and some A minor leagues. Obviously individual events were not completely recorded throughout much of baseball’s history so even though they are part of this scope, they are ephemeral, never to be known.
What are Entities?
An entity represents an abstraction to a type of data or object. For example, a car can be considered an entity that would represent all kinds of cars and separate from say trucks. An entity can be defined as “Road Vehicle” which would encompass both cars and trucks. This gets people wondering then where do minivans fall into and the right questions get asked. Decisions like this should be made at system level such as this.
All entities in the car example share many of the same characteristics such as having wheels, seats, dashboard, etc. each of which can be entities upon their own. The more one travels down this rabbit hole the more entities one encounters and must define. This process of definition ultimately breaks down a problem space into smaller and smaller problems to solve until you’re down to the bolts holding the engine in place. The resulting ERD can be used for many purposes downstream in the development cycle.
In this baseball data model I have defined entities such as players, teams, franchises, leagues, events, seasons, etc. Later I’ll post a proper entity relationship diagrams that will properly show how the defined entities relate to each other. Applications can then use the resulting model to perform complex searches using simple code. Other applications that populate data to this model perform complex operations ahead of time to meet the definition of the model alleviating individual downstream applications from that task.
That is all for now. Part 2 will come next followed by analysis of Japanese players transferring to the MLB. Until then ….