Part 2 – Baseball Data Entities

Definition of Entities in Baseball

In the last post I defined the concept of an entity in data modeling now let’s identify and define various entities in baseball.   The following will list some of the major entities.  Let’s call the first entity PLAYER.

PLAYER – A PLAYER has three types;  batter, pitcher, and fielder.  Some players represent all three types, others less.  All players represent at least one type.  You may have noticed from many of the tables published here the last column of a certain type of table identifies that PLAYER as either PITCH or BAT which changes what the columns mean.

SEASON – A SEASON represents just that, a single year or complete set of games.  Recently I posted a table listing the best post season players in the history of MLB.  I chose to lump all post season games, all teams, and all post season players into a single SEASON.   A SEASON represents a pool of data to draw averages from.  Treating each post season year as its own SEASON would drastically reduce that pool of data thus distorting any kind of averages used to rate PLAYERs.  This entity will unfold as we get more into the model.

TEAM – A TEAM is ephemeral lasting only one season, thus, a TEAM is contained by a LEAGUE (see below).  In the modern era of MLB baseball a LEAGUE contains 30 TEAMs.  In earlier MLB years and in different leagues the number of TEAMs assigned to a LEAGUE will differ but the relationship remains.  The 2013 Boston Red Sox TEAM will never occur again.  Next year will be the the 2014 Boston Red Sox with different PLAYERs (but they don’t have to be).

LEAGUE – Like TEAM, a LEAGUE is also ephemeral whose composition changes from one SEASON to the next.  A SEASON contains LEAGUE.  This entity will contain league wide stats and averages.   LEAGUE contains TEAM.  For example, the 2013 MLB LEAGUE contains 30 TEAMs.  The 2013 AAA  Pacific Coast LEAGUE contains 16 TEAMs.  The 1931 MLB LEAGUE contains 16 TEAMs.  At this level I made a decision to not treat AL and NL as separate.   All analysis treats MLB as a single LEAGUE.   The process of defining a data model flushes out these kind of decisions early on instead of in the coding phase.

FRANCHISE – This entity represents the historical franchise.  A FRANCHISE will contain be related to many SEASONs TEAMs depending upon how long it has been in existence.   The tables in this model will use a 3 letter acronym using the nomenclature defined by retrosheet.org event data.  It should be intuitively obvious as to their meaning but soon I’ll put up a helper table, maybe popup, somewhere in case of confusion.   Historical trends relating to a particular FRANCHISE can be very useful and interesting.  Many minor league teams are assigned a FRANCHISE.

Update: Made two mistakes with this one.  An entity should only be contained by one other entity and there is no reason a FRANCHISE needs to contain TEAM when it’s contained by LEAGUE.  Instead of containment it’s a simple relationship between TEAM and FRANCHISE.  At first I put FRANCHISE containing SEASON which makes no sense.  This is why it’s useful to have a visual representation of these things.   In this data model I decided to   hard code this to the set of 30 modern FRANCHISES defined by MLB, many of which  date back to the turn of the century.   MLB FRANCHISEs wouldn’t make sense if this model was being used to keep track of a little league or some foreign professional baseball league however.  The three letter FRANCHISE tag appends to every player in every table with XYZ representing no MLB FRANCHISE.

GAME – A TEAM plays 162 GAMEs in a single season.  Two TEAMs play in each GAME.  GAMEs are contained by the SEASON and not TEAM.   In modern era of MLB there are 2430 total GAMES/SEASON.   This number will differ with different leagues.  Each GAME is related to exactly two TEAMs, one home, one away.  This entity also holds information such as attendance, length, location, etc.

EVENT – Any time something happens in a baseball game it generates an EVENT.  The most common EVENT is a Plate Appearance (PA) where a batter did something.  A non-Plate Appearance (NPA) is when something happens like a stolen base or pick off where a plate appearance is not recorded for the batter.  Other events types include substitution, changing innings, starting and ending games.  All  EVENT entities in this data model were derived from data curated by retrosheet.org.  In this model the SEASON contains all EVENTs that occurred.

RUN – This entity is generated any time a RUN scores.  There are usually three PLAYERs related to a RUN;   the pitcher who gave it up,  a batter who hit it in (not always), and a runner who scored (always).  An EVENT caused the RUN to happen and is related to the PLAYER who is related to the TEAM which wins or loses a GAME based upon how many RUNs they scored.  There are three types of RUNs;  Unearned, Earned, and Lucky.  The first two are well known in baseball and determined by official score keepers so as to not assign a RUN to a pitcher if that RUN happened through a fielding mistake.  One type of lucky RUNs occur when a RUN scored during a non-plate appearance, such as a wild pitch or some weird thing happening unrelated to the batter.  I call it lucky because it needs to be called something and for the most part, other than stealing home plate, they are all lucky for the team that gets the RUN.  The RUN still counts, the runner gets credit for the RUN, but the batter does not get credit for an RBI.

ERROR – An ERROR occurs when a fielder makes a mistake.  A fielder can be any player, both pitcher and batter.  ERRORs are assigned my a judgment call made by an official MLB score keeper.  ERRORs lead to unearned runs.  Since ERRORs are related to the player who committed them, unearned runs can also be assigned that player as well.

HIT – A HIT is any official base hit that happens.  There are four types of HITs.

WALK – A WALK is any walk.  This model does not care about hit by pitch or intentional walks.  A WALK is a WALK and discerning this entity into various types is pointless.

BASE – This entity represents the constitution of the base pads.  There are 7 different types of this entity from no one on base to bases loaded.  BASE is related to EVENT and used for various counting algorithms.

What is the purpose for all this?

The data model provides a structure for independent applications to count stuff without having to do much busy work.   Data models allow to visually see ways to simplify and reduce complexity — something that should be done at a system level.  

Once finished and populated with data an independent app can run through the model using simple code and output reports for whatever information someone may want to know.  For example I generated a table assigning unearned runs to fielders.  This kind of stat is not usually calculated anywhere because it cannot be discerned from a box score;  you need event data.  Having the event data properly stored in a well defined data model makes acquiring that data rather simple.  The ERROR entity is related to the RUN entity, the PLAYER entity, and EVENT entity.  The app that assigns unearned runs to fielders can be accomplished with simple logic because a framework  has already been built by front end scripts.  Any baseball league throughout the history of the game can fit into this data model.

Conclusion

Notice that none of this has anything to do with WAA or how to rate baseball players.  The definition of these entities or objects will be important in understanding the underlying concept behind this rating system.  My original goal for this exercise was to clear my thoughts by building a completely different data model than the one I was working on and to get more practice writing code to build the model.  

In the next installment we’ll walk through a simple count script using EVENT data.