What is Event Data
Event data is a collection of things that happen in baseball ordered by their occurrence to describe a game, its outcome, and who did what. From 1973 until present collected event data has been pretty accurate. Every event that occurs creates an event record in the data set. Retrosheet.org has been curating this data and has made it available to the public here. This model enhances retrosheet data to make it easier for back end scripts to count whatever it is they need to count. This post will introduce the two data sets. Future posts will show applications using this data set.
Retrosheet.org Event Data
First let’s look at its format using Kirk Gibson’s walk off home run in game 1 of the 1988 World Series played in Los Angeles.
A detail explanation as to how to interpret these records can be read here. All event records are in comma separated value format. The first column represents an event type. In this set of 5 events we can see 4 plays and 1 sub event. The second csv is inning which applies only to play
events. From this set we can see it’s the 9th inning. We can’t determine outs. There are two.
The third value represents 1 for home team, 0 for away team at bat. The home team, Los Angeles Dodgers are at bat. You can’t tell it’s the Dodgers from this set which makes retrosheet’s event data highly order dependent. In order to tell who is home and who is away
a script must traverse the beginning of the game and keep track of that. The fourth column is a unique player identification which they map to name in a roster file. There is no need to be concerned with names at this level however. Player id will suffice.
The fifth column tells the final count when something happened. Sixth represents sequence of pitches and last column represents play event. In the first record above it shows davim002, Mike Davis, draws a walk. The next record shows penaa001, Alejandro Pena, comes to the plate but gets replaced by a New Player (NP). The next record shows Kirk Gibson, gibsk001 becoming the batter through substitution. A Stolen Base to second (SB2) happens next. This is considered a
play event in these records. The last record shows Gibson going all the way to a 3-2 count by hitting the first three pitches foul and then drawing 3 balls. He then hits a Home Run scoring Davis from second base (2-H) . That is the last play of the ball game a walk off home run.
Enhanced Event Data
The retrosheet event records had a lot of useful data missing. You can’t tell from a single event record the number of outs at the time, score, constitution of base pads, etc. This data model set out to create standalone records that could be extracted independently and still provide game situation context. Here is the snippet of enhanced event data for those 5 events mentioned earlier:
9:1:3:8:2 LAN davim002 31 CBBBB W WALK 2 -0- -1- 3 4 0 0 0 10058316 198810150
NP9:1:3:8:2 LAN penaa001 00 ?? NP NEW_PLAYER 2 -1- -1- 3 4 0 0 0 10058317 198810150
SUB gibsk001 LAN 1 9 11 LAN198810150
SB9:1:3:8:2 LAN gibsk001 02 FFF1B SB2 STOLEN_BASE 2 -1- -20- 3 4 0 0 0 10058318 198810150
9:1:4:9:2 LAN gibsk001 32 FFF1BBBX HR/L9D.2-H HOME_RUN 2 -20- -0- 5 4 2 1 0 10058319 198810150
CHANGE_INNING 91 LAN 5 7 0 OAK 4 7 0 LOB 0 2 1 0 4 9 LAN198810150
END_OF_GAME 19881015 LAN 5 7 0 OAK 4 7 0 LAN198810150
NEW_GAME 56051 150 LAN OAK LAN198810160
These are space delineated records. A script takes retrosheet event data, calculates a game state, and places more game data into the record so it becomes independent of events before and after. Future scripts do not have to trace back to determine game state because that has already been done and error checked. This makes things far simpler for secondary counting scripts.
The first field represents inning that itself has a set of values delineated by colon. Any inning field that begins with an integer (0-9) is considered a Plate Appearance (PA). All other entities are considered Non-Plate Appearances (NPA). I won’t get into all the different types of NPAs in this post.
The 5 fields in the inning field are as follows in order: inning, home/away, inning sequence, lineup sequence, outs. Inning sequence resets each inning and represents how many batters batted since the beginning of the inning. Lineup sequence gives the original lineup number. It’s modulo 9 and used by scripts that want to count stuff according to original lineup.
The second field shows the 3 letter team id batting, third player id, 4th count, 5th pitches, 6th play string, 7th play event in English, 8th outs again, 9th base pads before batting, 10th base pads after batting, 11th and 12th score, 13th 14th and 15th Runs Hits Errors from his play, 16th event id, 17th game id. Since the EVENT entity is contained by SEASON, the event id is unique on a per season basis.
This data is not meant for human consumption. I included CHANGE_INNING, END_OF_GAME and NEW_GAME records who have their own set of data. A script can search END_OF_GAME and derive date, home and away reams and simple line score. Since END_OF_GAME data was
derived from event data a good integrity check is to take data from another source and check to make sure the line scores match. If they don’t the event data is not compiled correctly.
Many theories can be tested by tracing event data. In the next installment I’ll step through one such theory to show the value of the event data corpus.