The Simulation Part 2

This post will provide some inside baseball (pun intended) on some of the coding going on in this data model in general and the Monte Carlo simulations in particular.  Errors in compiling data will propagate into results so every effort has been made to provide inline error checking everywhere.  A script will abort if the error absolutely needs fixing and print SYSERRs into the data if that error can be fixed later.

Cleaning up SYSERRs and bugs takes most of the time.   There are still many low priority SYSERRs currently in the data set that eventually will need to be purged but they pose no harm to simulations, player evaluations or ranking.

The dataset used for Monte Carlo simulation are all games from years 1970-2016 which is a little over 100,000 games.  Event data from retrosheet.org, required to take day by day league snapshots, are pretty reliable in these years.  Although this data model has compiled game by game snapshots going back to around 1920, the early eras may skew simulations.

These simulations rely on three aspects; lineup value, starter value, and relief value.  In pre-1970 years, and perhaps pre-1980, relief pitching was not valued.  Washed up pitchers were put in relief and usually used when the game was already lost and everyone was thinking about what bars to hit afterward.  Starters pitched many complete games and often more than 300 innings per season.   Since relief is such an important aspect in modern baseball, it may not be wise to include earlier than 1970 in the simulation dataset.  These simulations will be used to evaluate modern baseball.  Even between 1970-1980 relief wasn’t valued like it is today.

March and April games are excluded each year because there isn’t enough current year data to accurately value players.  The error eliminated by excluding those months exceeds the error incurred from a slimmer dataset.

 The Simulation

Each game consists of four pairs:

  1. away lineup -> home starter (l-s)
  2. away lineup -> home relief (l-r)
  3. home lineup -> away starter (l-s)
  4. home lineup -> away relief (l-r)

There are two pair types, lineup -> starter (l-s) and lineup -> relief (l-r).

Starters and groups of players for lineup and relief are placed into 5 tiers based upon league averages and standard deviations.  These calculations are made at the beginning of every day for every team between 1970-2016.  Each pair type will have its own distribution which consists of many thousands of games.  The average pair type 3-3 , average against average, will have distributions representing the most games.

The simulation will run millions of iterations.  Each iteration will randomly grab a pair of numbers from the lineup -> starter distribution.  This pair is number of runs and inning pitched by a starter from an actual game in the dataset.   An average runs/out given up from a random game will be returned for lineup -> relief pair lookup.

The number of relief innings is determined by the starting innings pitched returned from the  l->s set.  Multiply that by the  average relief runs/out, add to starter runs given up, and you have total runs for a team.  Do this for home and away.  The lineup with the most runs wins that iteration, the other team loses.  Do this millions of times and you converge to a win/loss percentage as well as average runs per game.

More detail will be provided in subsequent parts to this series.

The number of iterations in a simulation determines its error.  When the code was first working 1000 iterations was taking 15 seconds per game and producing a large error which was described in this post.  To do 10K iterations would take 3 minutes, 100K 30 minutes, etc.  which was totally unacceptable.  It could take 8 hours to produce results for one day of baseball.  Multiply that by 180 days in a baseball season and that gets ugly.

Next step was figuring out how to use parallelism and eliminate waste in loops.  What I found was quite amazing as to how simple the solution to this problem was.

The Bug

Now let’s get into some inside baseball perl script nerd talk.

The game results for each pair type gets pushed into an array.  These arrays can often hold data from more than 10,000 games each.   The beginning of the script reads the entire 100K game dataset and populates two array hashes %lstierlookup and %lrtierlookup , keyed by  tier pair.

Later in the script when it has to do a lookup it must index an array like this:

$myindex = int rand ( $#myarray ) ;
$mylookup = @myarray[ $myindex ] ;

The variable $myindex is a random integer the size of the array so it ends up indexing a random item in @myarray which returns variables from a real game in the dataset.  Not too complicated.

Often when I can’t figure out syntax to something I use the Keep It Simple Stupid method and use a work around.  I couldn’t figure out the syntax to index an array hash so I did this.

my @lshomearray = @{$lstierlookup{$homerlstierkey}} ;
my @lsawayarray = @{$lstierlookup{$awaylstierkey}} ;
my @lrhomearray = @{$lrtierlookup{$homerlrtierkey}} ;
my @lrawayarray = @{$lrtierlookup{$awaylrtierkey}} ;

Now the array can be easily indexed using syntax shown above.  The perl code however actually transferred the entire 10K+ items in each array to the other array.  This was in the iteration loop which means 1000 iterations meant 4000 array transfers.   The Keep It SImple Stupid method of being lazy not looking up the proper syntax trumped the Keep It Simple Stupid when if comes to CPU cycles required to process the loop.

Now 1,000,000 iterations takes about 10 seconds and has virtually no error due to simulation.  I can run the same sim 100 times and virtually no variation in the outcome.

The proper syntax for indexing a hashed array is:

my %hasharray = () ;
my $indexed_value = $hasharray{$mykey}[$myindex] ;

And that was all there was to it.  More detail on these simulations will be explained in subsequent parts to this series.  The above is the gist of how the win/loss percentages are calculated.

There is a problem with strict tiering.  For example, two Tier 3 players could be on opposite ends of their boundaries or a Tier 4 and a Tier 3 player could be almost equal in value, only separated by an arbitrary boundary.  This problem has been solved using a somewhat different method.

The four pairs (home and away, l-s, l-r) and what those distributions store and return to determine wins and losses remain the same as described above.  All of these simulations rest upon the foundation of this data model and how it assigns value to players and groups of players.  More on this later.  Until then ….