Tag Archives: data modeling

Part 2 – Baseball Data Entities

Definition of Entities in Baseball

In the last post I defined the concept of an entity in data modeling now let’s identify and define various entities in baseball.   The following will list some of the major entities.  Let’s call the first entity PLAYER.

PLAYER – A PLAYER has three types;  batter, pitcher, and fielder.  Some players represent all three types, others less.  All players represent at least one type.  You may have noticed from many of the tables published here the last column of a certain type of table identifies that PLAYER as either PITCH or BAT which changes what the columns mean.

SEASON – A SEASON represents just that, a single year or complete set of games.  Recently I posted a table listing the best post season players in the history of MLB.  I chose to lump all post season games, all teams, and all post season players into a single SEASON.   A SEASON represents a pool of data to draw averages from.  Treating each post season year as its own SEASON would drastically reduce that pool of data thus distorting any kind of averages used to rate PLAYERs.  This entity will unfold as we get more into the model.

TEAM – A TEAM is ephemeral lasting only one season, thus, a TEAM is contained by a LEAGUE (see below).  In the modern era of MLB baseball a LEAGUE contains 30 TEAMs.  In earlier MLB years and in different leagues the number of TEAMs assigned to a LEAGUE will differ but the relationship remains.  The 2013 Boston Red Sox TEAM will never occur again.  Next year will be the the 2014 Boston Red Sox with different PLAYERs (but they don’t have to be).

LEAGUE – Like TEAM, a LEAGUE is also ephemeral whose composition changes from one SEASON to the next.  A SEASON contains LEAGUE.  This entity will contain league wide stats and averages.   LEAGUE contains TEAM.  For example, the 2013 MLB LEAGUE contains 30 TEAMs.  The 2013 AAA  Pacific Coast LEAGUE contains 16 TEAMs.  The 1931 MLB LEAGUE contains 16 TEAMs.  At this level I made a decision to not treat AL and NL as separate.   All analysis treats MLB as a single LEAGUE.   The process of defining a data model flushes out these kind of decisions early on instead of in the coding phase.

FRANCHISE – This entity represents the historical franchise.  A FRANCHISE will contain be related to many SEASONs TEAMs depending upon how long it has been in existence.   The tables in this model will use a 3 letter acronym using the nomenclature defined by retrosheet.org event data.  It should be intuitively obvious as to their meaning but soon I’ll put up a helper table, maybe popup, somewhere in case of confusion.   Historical trends relating to a particular FRANCHISE can be very useful and interesting.  Many minor league teams are assigned a FRANCHISE.

Update: Made two mistakes with this one.  An entity should only be contained by one other entity and there is no reason a FRANCHISE needs to contain TEAM when it’s contained by LEAGUE.  Instead of containment it’s a simple relationship between TEAM and FRANCHISE.  At first I put FRANCHISE containing SEASON which makes no sense.  This is why it’s useful to have a visual representation of these things.   In this data model I decided to   hard code this to the set of 30 modern FRANCHISES defined by MLB, many of which  date back to the turn of the century.   MLB FRANCHISEs wouldn’t make sense if this model was being used to keep track of a little league or some foreign professional baseball league however.  The three letter FRANCHISE tag appends to every player in every table with XYZ representing no MLB FRANCHISE.

GAME – A TEAM plays 162 GAMEs in a single season.  Two TEAMs play in each GAME.  GAMEs are contained by the SEASON and not TEAM.   In modern era of MLB there are 2430 total GAMES/SEASON.   This number will differ with different leagues.  Each GAME is related to exactly two TEAMs, one home, one away.  This entity also holds information such as attendance, length, location, etc.

EVENT – Any time something happens in a baseball game it generates an EVENT.  The most common EVENT is a Plate Appearance (PA) where a batter did something.  A non-Plate Appearance (NPA) is when something happens like a stolen base or pick off where a plate appearance is not recorded for the batter.  Other events types include substitution, changing innings, starting and ending games.  All  EVENT entities in this data model were derived from data curated by retrosheet.org.  In this model the SEASON contains all EVENTs that occurred.

RUN – This entity is generated any time a RUN scores.  There are usually three PLAYERs related to a RUN;   the pitcher who gave it up,  a batter who hit it in (not always), and a runner who scored (always).  An EVENT caused the RUN to happen and is related to the PLAYER who is related to the TEAM which wins or loses a GAME based upon how many RUNs they scored.  There are three types of RUNs;  Unearned, Earned, and Lucky.  The first two are well known in baseball and determined by official score keepers so as to not assign a RUN to a pitcher if that RUN happened through a fielding mistake.  One type of lucky RUNs occur when a RUN scored during a non-plate appearance, such as a wild pitch or some weird thing happening unrelated to the batter.  I call it lucky because it needs to be called something and for the most part, other than stealing home plate, they are all lucky for the team that gets the RUN.  The RUN still counts, the runner gets credit for the RUN, but the batter does not get credit for an RBI.

ERROR – An ERROR occurs when a fielder makes a mistake.  A fielder can be any player, both pitcher and batter.  ERRORs are assigned my a judgment call made by an official MLB score keeper.  ERRORs lead to unearned runs.  Since ERRORs are related to the player who committed them, unearned runs can also be assigned that player as well.

HIT – A HIT is any official base hit that happens.  There are four types of HITs.

WALK – A WALK is any walk.  This model does not care about hit by pitch or intentional walks.  A WALK is a WALK and discerning this entity into various types is pointless.

BASE – This entity represents the constitution of the base pads.  There are 7 different types of this entity from no one on base to bases loaded.  BASE is related to EVENT and used for various counting algorithms.

What is the purpose for all this?

The data model provides a structure for independent applications to count stuff without having to do much busy work.   Data models allow to visually see ways to simplify and reduce complexity — something that should be done at a system level.  

Once finished and populated with data an independent app can run through the model using simple code and output reports for whatever information someone may want to know.  For example I generated a table assigning unearned runs to fielders.  This kind of stat is not usually calculated anywhere because it cannot be discerned from a box score;  you need event data.  Having the event data properly stored in a well defined data model makes acquiring that data rather simple.  The ERROR entity is related to the RUN entity, the PLAYER entity, and EVENT entity.  The app that assigns unearned runs to fielders can be accomplished with simple logic because a framework  has already been built by front end scripts.  Any baseball league throughout the history of the game can fit into this data model.

Conclusion

Notice that none of this has anything to do with WAA or how to rate baseball players.  The definition of these entities or objects will be important in understanding the underlying concept behind this rating system.  My original goal for this exercise was to clear my thoughts by building a completely different data model than the one I was working on and to get more practice writing code to build the model.  

In the next installment we’ll walk through a simple count script using EVENT data.

Introduction to this site – Part 1

Introduction

The site has been up for a couple weeks now and I wanted to get familiar with the process of coming up with relevant topics to create a search for my data model, copying and pasting the results into this WordPress blog format, and writing some explanations as to what it all means.   It just so happened that the playoffs were under way providing more fodder for searches.  Now it is time to step back and provide background as to what all this data means and am I just making it up.   This post will be one of many because I won’t be able to organize all the information properly in one fell swoop.   Eventually all of this will be put into a FAQ and/or an About page.

Why are you doing this?

This project began last April when I heard Darwin Barney, 2012 gold glove winner for the Chicago Cubs was named that team’s most valuable player with a WAR=4.8. WAR is an acronym meaning Wins Above Replacement and there are several variations from different sites. I referenced  WAR derived by baseball-reference.com. What does the number 4.8 mean? If WAR is related to “value” that means Darwin Barney was worth more than Kyle Lohse who went 16-3 with a 2.86 ERA over 211 innings pitched and had a WAR=4.3. Even though the Cubs lost more than 100 games that year I didn’t think Barney was its best player let alone better than Kyle Lohse so I began to investigate.

First I attempted to analyze their mathematics and got lost. The more I read the more I became confused. At the same time I was working a data model for a networking project and got stuck. That model wasn’t right, all my entities seemed wrong, and I wondered how to make sense of it even though it was completely coded and operational.   This can happen with the abstract nature of data models (see below).  I couldn’t explain the model in my documentation. In order to clear my head I decided it might be fun to model something completely different like baseball.  I didn’t  plan on coming up with a player value system;  I just wanted to see if I could define entities and their relationships and create a framework to easily count stuff using simple scripts.

I cleaned the whiteboard and started to draw an Entity Relationship Diagram (see below) for baseball. Six months and over 7000 lines of perl code later I have every MLB player since 1890 rated and ranked as well as players in many years in many minor leagues.  The mathematics have proofs and those proofs are used as integrity checks that can tell if there is deficiencies in the source data or code.

What did I learn?   Ryan Dempster and Alfonso Soriano were the two best Cubs in 2012 (rank column is rank amongst all MLB pitchers and batters).  The purpose for this table format will be discussed later.  Darwin Barney played below average even though WAR had him highly valued.

Rank WAA BA/IP OBP/ERA PA/G RBI/W R/L Name_Tm BAT/PITCH
37 4.7 104.0 2.25 16 5 5 Ryan_Dempster_CHN PITCH
56 3.9 0.262 0.322 615 108 68 Alfonso_Soriano_CHN BAT
2173 -2.0 0.254 0.299 588 44 73 Darwin_Barney_CHN BAT

What is a Data Model?

The cut to the chase answer: A Data Model defines  scope of data, its entities, and how those entities relate with each other.  A properly defined data model allows various independent applications to share data seamlessly.   It is used to define data requirements for software development and for business processes.   Designers draw a “map” to keep track of the model called an Entity Relationship Diagram (ERD) displaying the relationships between entities in a concise format.   Data modeling is an entire field of software engineering so I won’t go further than this brief introduction and definition.

[expand title=”Read On”]

What is the Scope of Data?

The scope of data defines data required for the  project, defined to set limits otherwise a project becomes boundless.   The scope for this data model involves stats for every baseball player and every baseball event in every league in the history of the game.   Stats can be anything from game attendance to salary to batting average.  If you have an app that wants to look up a plumber in the yellow pages, that would be outside the scope of this project.

If you have an app that wants to look up the count Kirk Gibson had when he hit his two run walk off home run in game one of the 1988 World Series, that information is part of the scope of this data model.  Just because data falls into this scope doesn’t mean it can be retrieved.  The data must first be defined to be retrieved.   To find the count in Kirk Gibson’s HR requires a search through post season event entities.   Post season events for 1988 are available at retrosheet.org on this page.  I’ll get into more detail on the different kinds of data input later.

The scope of data for this project is rather large.  The data model as currently defined is a small subset of the overall scope covering all of MLB from 1890 to present and many years of AAA, AA, A+, and some A minor leagues.  Obviously individual events were not completely recorded throughout much of baseball’s history so even though they are part of this scope, they are ephemeral, never to be known.

What are Entities?

An entity represents an abstraction to a type of data or object.  For example, a car can be considered an entity that would represent all kinds of cars and separate from say trucks.  An entity can be defined as “Road Vehicle” which would encompass both cars and trucks.  This gets people wondering then where do minivans fall into and the right questions get asked.   Decisions like this should be made at system level such as this.

All  entities in the car example share many of the same characteristics such as having wheels, seats, dashboard, etc.  each of which can be entities upon their own.  The more one travels down this rabbit hole the more entities one encounters and must define.  This process of definition ultimately breaks down a problem space into smaller and smaller problems to solve until you’re down to the bolts holding the engine in place.   The resulting ERD can be used for many purposes downstream in the development cycle.

In this baseball data model I have defined entities such as players, teams, franchises, leagues, events, seasons, etc.  Later I’ll post a proper entity relationship diagrams that will properly show how the defined entities relate to each other.  Applications can then use the resulting model to perform complex searches using simple code.    Other applications that populate data to this model  perform  complex operations ahead of time to meet the definition of the model alleviating individual downstream applications from that task.

That is all for now.  Part 2 will come next followed by analysis of Japanese players transferring to the MLB.  Until then ….

[/expand]