Category Archives: Information

What is an OPS Part 5

I have been sitting on these tables for almost a month.  In the last part to this series we meandered into run creation estimation.  Run creation uses the Total Base (TB) stat which is a numerator in  Slugging Ratio (SLG).   On Base Percentage (OBP) is the chocolate and SLG is the peanut butter.  Together they make up the OPS butter cup proving that, in mathematics, any two numbers can be added together to make a third.

Total Bases is a useful game stat and  SLG is a helpful reference for game by game management players.  It was shown In Part 4 of this series that TB/H, how many Total Bases per hit, converges to almost exactly 1.5 or 3/2.   It might be possible to prove that when plate appearances approaches infinity, TB/H has to converge to 3/2.

If a batter is hitting 0.300 BA his SLG should be around 0.450   More then he’s getting a lot of extra bases, less he’s getting less than average.  Neither BA or SLG are value stats but they could be used to help see matchup opportunities etc.  The reason this has meandered into Run Creation because that’s a big aspect to how the value stat WAR is determined.  OPS is even used as a value stat by TV sports announcers.  ( Hello Jim Deshaies! )

Another interesting factoid is how the number of runs scored converges to the following formula:

Runs = Hits/2 

There might be a way to prove that for infinite number of PA the above is always true.  The very basic runs created formula, according to Wikipedia is:

(H+W)*TB/PA

The following table is a total compilation of data by decade.  We’ll drill down deeper in the next part to this series.  Highlighted in blue is the formula with less error for that decade.

Fun with numbers

Decade H/2 (KISS)
(H+W)*TB/PA
1920-1929 1.021 0.953
1930-1939 0.988 0.952
1940-1949 1.034 0.955
1950-1959 0.994 0.981
1960-1969 1.041 0.973
1970-1979 1.046 0.983
1980-1989 1.026 0.986
1990-1999 0.969 0.992
2000-2009 0.957 1.013
1920-2017 1.004 0.985

The above ratios are the estimated runs using each formula divided the actual number of runs scored.  Not sure what the above is supposed to mean other than the RC formula using TB has been much more accurate in modern baseball.

Below are the total averages for BA, OBP, and SLG for each decade in our study.  Not sure how this is relevant but added to provide some perspective.  In theory another column could have been added …. but didn’t want to confuse things.  All three stats in the below table are useful on their own.  Mashing them up into some other number obfuscates their benefit as a game stat.

Decade BA OBP SLG
1920-1929 0.285 0.336 0.397
1930-1939 0.279 0.336 0.399
1940-1949 0.260 0.327 0.368
1950-1959 0.259 0.327 0.391
1960-1969 0.249 0.311 0.374
1970-1979 0.256 0.320 0.377
1980-1989 0.259 0.320 0.388
1990-1999 0.265 0.331 0.410
2000-2009 0.265 0.332 0.424
2010-2017 0.255 0.318 0.405
1920-2017 0.262 0.325 0.395

The next table are some interesting ratios from each decade.   TB/H converges to almost 1.5 and Hits/Walks converges to almost 5/2.  There are around 1/8 ~ 12.5% more recorded plate (PA) appearances than at bats (AB).  We covered these stats earlier.  Sac Flies and Bunts and all the other little stuff are negligible variables that should be eliminated.  Walks are the reason for that 1/8 difference.  Since this model uses official PA stat  IBBs are not included.  Those are negligible as well (i.e. doesn’t matter in the big picture).

Decade PA/AB H/W TB/H
1920-1929 1.129 3.040 1.391
1930-1939 1.116 2.873 1.432
1940-1949 1.126 2.417 1.415
1950-1959 1.130 2.352 1.509
1960-1969 1.120 2.510 1.503
1970-1979 1.125 2.488 1.471
1980-1989 1.120 2.593 1.498
1990-1999 1.128 2.443 1.548
2000-2009 1.126 2.458 1.597
2010-2017 1.115 2.555 1.590
1920-2017 1.123 2.541 1.507

The next part of this series will break all this down where error can be measured from season to season, team to team, player to player.  Spoiler Alert:  The official RC formula with TB has far less error than Hits/2 on a season to season and team to team basis.   The two formulae are equal on a player to player basis, each having around a +/- 20% error compared to actual runs that scored.

More on this data and methodology to come.  Until then ….

The Simulation Part 2

This post will provide some inside baseball (pun intended) on some of the coding going on in this data model in general and the Monte Carlo simulations in particular.  Errors in compiling data will propagate into results so every effort has been made to provide inline error checking everywhere.  A script will abort if the error absolutely needs fixing and print SYSERRs into the data if that error can be fixed later.

Cleaning up SYSERRs and bugs takes most of the time.   There are still many low priority SYSERRs currently in the data set that eventually will need to be purged but they pose no harm to simulations, player evaluations or ranking.

The dataset used for Monte Carlo simulation are all games from years 1970-2016 which is a little over 100,000 games.  Event data from retrosheet.org, required to take day by day league snapshots, are pretty reliable in these years.  Although this data model has compiled game by game snapshots going back to around 1920, the early eras may skew simulations.

These simulations rely on three aspects; lineup value, starter value, and relief value.  In pre-1970 years, and perhaps pre-1980, relief pitching was not valued.  Washed up pitchers were put in relief and usually used when the game was already lost and everyone was thinking about what bars to hit afterward.  Starters pitched many complete games and often more than 300 innings per season.   Since relief is such an important aspect in modern baseball, it may not be wise to include earlier than 1970 in the simulation dataset.  These simulations will be used to evaluate modern baseball.  Even between 1970-1980 relief wasn’t valued like it is today.

March and April games are excluded each year because there isn’t enough current year data to accurately value players.  The error eliminated by excluding those months exceeds the error incurred from a slimmer dataset.

 The Simulation

Each game consists of four pairs:

  1. away lineup -> home starter (l-s)
  2. away lineup -> home relief (l-r)
  3. home lineup -> away starter (l-s)
  4. home lineup -> away relief (l-r)

There are two pair types, lineup -> starter (l-s) and lineup -> relief (l-r).

Starters and groups of players for lineup and relief are placed into 5 tiers based upon league averages and standard deviations.  These calculations are made at the beginning of every day for every team between 1970-2016.  Each pair type will have its own distribution which consists of many thousands of games.  The average pair type 3-3 , average against average, will have distributions representing the most games.

The simulation will run millions of iterations.  Each iteration will randomly grab a pair of numbers from the lineup -> starter distribution.  This pair is number of runs and inning pitched by a starter from an actual game in the dataset.   An average runs/out given up from a random game will be returned for lineup -> relief pair lookup.

The number of relief innings is determined by the starting innings pitched returned from the  l->s set.  Multiply that by the  average relief runs/out, add to starter runs given up, and you have total runs for a team.  Do this for home and away.  The lineup with the most runs wins that iteration, the other team loses.  Do this millions of times and you converge to a win/loss percentage as well as average runs per game.

More detail will be provided in subsequent parts to this series.

The number of iterations in a simulation determines its error.  When the code was first working 1000 iterations was taking 15 seconds per game and producing a large error which was described in this post.  To do 10K iterations would take 3 minutes, 100K 30 minutes, etc.  which was totally unacceptable.  It could take 8 hours to produce results for one day of baseball.  Multiply that by 180 days in a baseball season and that gets ugly.

Next step was figuring out how to use parallelism and eliminate waste in loops.  What I found was quite amazing as to how simple the solution to this problem was.

The Bug

Now let’s get into some inside baseball perl script nerd talk.

The game results for each pair type gets pushed into an array.  These arrays can often hold data from more than 10,000 games each.   The beginning of the script reads the entire 100K game dataset and populates two array hashes %lstierlookup and %lrtierlookup , keyed by  tier pair.

Later in the script when it has to do a lookup it must index an array like this:

$myindex = int rand ( $#myarray ) ;
$mylookup = @myarray[ $myindex ] ;

The variable $myindex is a random integer the size of the array so it ends up indexing a random item in @myarray which returns variables from a real game in the dataset.  Not too complicated.

Often when I can’t figure out syntax to something I use the Keep It Simple Stupid method and use a work around.  I couldn’t figure out the syntax to index an array hash so I did this.

my @lshomearray = @{$lstierlookup{$homerlstierkey}} ;
my @lsawayarray = @{$lstierlookup{$awaylstierkey}} ;
my @lrhomearray = @{$lrtierlookup{$homerlrtierkey}} ;
my @lrawayarray = @{$lrtierlookup{$awaylrtierkey}} ;

Now the array can be easily indexed using syntax shown above.  The perl code however actually transferred the entire 10K+ items in each array to the other array.  This was in the iteration loop which means 1000 iterations meant 4000 array transfers.   The Keep It SImple Stupid method of being lazy not looking up the proper syntax trumped the Keep It Simple Stupid when if comes to CPU cycles required to process the loop.

Now 1,000,000 iterations takes about 10 seconds and has virtually no error due to simulation.  I can run the same sim 100 times and virtually no variation in the outcome.

The proper syntax for indexing a hashed array is:

my %hasharray = () ;
my $indexed_value = $hasharray{$mykey}[$myindex] ;

And that was all there was to it.  More detail on these simulations will be explained in subsequent parts to this series.  The above is the gist of how the win/loss percentages are calculated.

There is a problem with strict tiering.  For example, two Tier 3 players could be on opposite ends of their boundaries or a Tier 4 and a Tier 3 player could be almost equal in value, only separated by an arbitrary boundary.  This problem has been solved using a somewhat different method.

The four pairs (home and away, l-s, l-r) and what those distributions store and return to determine wins and losses remain the same as described above.  All of these simulations rest upon the foundation of this data model and how it assigns value to players and groups of players.  More on this later.  Until then ….

PHI CHN Lines Spotcheck

This is a spotcheck of the lines today for the Cubs Phillies.  The Monte Carlo approach in these simulations has some error and now it must be tested.  This post will simply be a walk through.

tl;dr The lines for the Cubs game settled from opening night of this series.  Lines for both teams yesterday and today are/were clear discards.

Ouija Board

DATE 06_07_2:20_PM PHI CHN

LINEAWAY PHI [ 0.444 ] < 0.483 >
STARTAWAY 1.13(0.582) Nick_Pivetta_PHI TIER 3
--------------------------------------------
LINEHOME CHN [ 0.592 ] < 0.539 >
STARTHOME 0.21(0.518) Tyler_Chatwood_CHN TIER 3
--------------------------------------------
PHI 32 27 CHN 34 24
DELTAWAA 5 WINPCT 0.542 CHN
--------------------------------------------
TIER COMBOS
PHI Lineup 4 ==> CHN Starter 3 / Relief 1 == 0.444 PHI 3.99 runs
CHN Lineup 1 ==> PHI Starter 3 / Relief 2 == 0.556 CHN 4.66 runs

According to current lines the markets think PHI has a 0.483 chance of winning which is above their TIER COMBO simulation result.  At 0.556 simulation result CHN is above what the market thinks of them (0.539) but well below our 0.070 threshold.  This simulation would need to produce a result greater than 0.610 (round up) in order to bet CHN.  Both lines a clear discard.

And that brings this to error.  It has been discovered the error in the above simulation is +/-0.030 with 95% certainty.  That means PHI is between 0.414 and 0.474 , CHN will fall between 0.526 and 0.586.

The first run of these sims are only using 1000 iterations.  If we bump that up to 10,000 iterations the error drops to 0.01.  This is stressing the limits of  computers running this here.  Everything is written in Perl which isn’t known for its real time capability.

Right now it will take an hour to process  matchups for every game in a day at 10K iterations.   For the next month or so the 10K sims will be used unless otherwise specified.

That is all for now.  Cubs status in a couple of days.  Until then ….

What is an OPS? Part 4

Today we’ll cover the S in OPS, Slugging ratio.  The P in OPS stands for Plus, thus the formula

OPS = OBP + SLG

OBP is the chocolate, Slugging ratio (SLG) is the peanut butter and OPS is a Reese’s peanut butter cup.   Total Bases (TB) is a fundamental building block stat that SLG relies on.  It weights a 1B = 1, 2B = 2, 3B = 3, and HR =4.  Not too complicated!   This means it follows our Keep It Simple Stupid (KISS) principle to this data model.

Total Bases

Total Bases is an interesting concept and players accumulate these like they do Home Runs, Stolen Bases, etc.   Slugging ratio is simply

SLG = TB / AB

The range for this ratio is 0 – 4.   The definition of a percentage is “any proportion or share in relation to a whole.”  This requires a ratio with a range of 0 – 1 which SLG does not qualify.  Someone should fix the Wikipedia definition.

Since the Slugging stat implies actually hitting a baseball they chose not to include walks into Total Bases and thus, its denominator is the At Bat (AB) stat and not Plate Appearance which includes walks.  But what does this Slugging stat mean?

First let’s answer the question: “On average, how many total bases occur for each hit?”   We’ll use a post White Sox scandal dataset (1920 – 2016).  Adding up all TBs and Hs from every player every year we get:

TB / H = 1.505  (1920 – 2016)

This is very close to the fraction 3/2 and there might be a way to prove that TB/H always converges to that when the number of seasons approaches infinity.   For now we’ll just call this baseball equation true for all seasons.

TB = 1.5 * H  <— KISS

Since BA = H / AB , then

SLG / BA = ( TB / AB ) / ( H / AB ) = ( ( 1.5 * H / AB ) / ( H / AB ) = 1.5

Therefore SLG = 1.5 * BA  <— KISS

What does this mean?  If a batter has a BA of 0.300 his Slugging ratio should be 0.450,  If it is above, that hitter is hitting with more power, below, less than average power.   If we assume that baseball constant of 1.5 is close enough, just like we assume 9 innings is close enough to be considered an entire game for ERA, then that simplifies these equations drastically.  You can do the above calculation in your head.

This only tells you whether or not a hitter hits a lot of extra base hits.  This may be important for a manager determining a lineup but it does not describe that player’s actual value to the team’s wins and losses.  SLG and TB are game stats.

Now that we introduced TB let’s meander.  Runs Created uses TB.

What is Runs Created?

Runs Created was a Bill James concept to estimate runs from certain kind of hits and events that occur in the game.   Wikipedia has a page that describes it.  In its most basic form it boils down to this:

RC =  ( H + W ) * TB / PA

There are some algebra mistakes on that page but none of this matters.   There is quite a bit of unproven nonsense that follow which expand upon the above.  They take an end result and try to mold a formula to fit that end result.  This is not how math is supposed to work.

In the next part I’ll run some numbers using the above formula to see how close it is to the actual runs produced.  I do not know exactly how WAR is calculated but know Runs Created is a big part.  We pull in the WAR value stat to compare with ours.  When we see a disparity in batters it’s usually due to error in how they calculate RC.

Here is another way to estimate Runs Created:

RC = H / 2  <— KISS

Whenever you see an RHE line in a baseball game divide the H column by 2.  If that team’s runs exceeds the number of runs they are getting good value on their hitting.  How do we know this?

H / R = 2.01  ( 1920 – 2016 ) 

I suspect this too converges to an even 2 for some reason if there are an infinite number of seasons played.  The standard deviation for this from year to year is 0.1 so it’s pretty consistent.  This is another interesting baseball constant that can be used with game stats and simple enough to calculate in your head and possibly more accurate than all the various concoctions the Sabermetric crowd comes up with.

In Part5 we will continue this meander and show some Runs Created data and compare it to actual runs created.  Until then….

What is an OPS? Part 3

After reviewing what was written in Part 2 of this series there needs to be more clarification as to the difference between Sabermetric  and Keep It Simple Stupid (KISS)  approaches.  Our premise is that complexity added to the Sabermetric version is at best, a big headache to calculate that makes no discernible difference, and at worst, deceptive.

This data model is runs based and does not care about hits, walks, strikeouts, stolen bases, etc.  Those are all game stats based upon hits.  When the MLB commissioner determines the winner of a baseball game, he picks the team with the highest number in the R column ignoring the H column.  A pitcher can throw 20+ strikeouts in a game and still lose because the baseball commissioner doesn’t give any points for throwing strikeouts.  This goes for many of the myriad of stats that sites like Fangraphs and Draft Kings peddle.

Since people are going to want to know how many homers so and so hit or what kind of WHIP does this pitcher have it is important we draw in these stats.  OPS is my favorite hitting stat because it is so popular with the media and it is also somewhat nonsense — which we will get to in subsequent Parts to this series.  We’re just killing time here before we hit the 1/6 baseball season when our real data model kicks in using current year datasets.

Let’s look again at the OBP  formula according to the Wikipedia page.

obp-formula

We described in Part 2 how the At Bat stat (AB) was developed for Batting Average.  The denominator in the above defines a different type of count for a player At Bat without giving it a name.  The Keep It Simple Stupid approach simply makes the denominator Plate Appearances.  KISS treats all walks equally and cares not whether it was through getting hit, intentional, or the old fashioned way of drawing 4 balls.

OBP = (H + W) / PA  where W = BB + HBP + IBB  <– KISS OBP

We demonstrated how the difference between KISS OBP and the Sabermetric OBP is negligible.  Eliminating Intentional Walks from OBP was clearly a human decision so as not to reward certain players like Barry Bonds or Albert Pujols.  They think it isn’t fair those guys get free passes while everyone else has to work for their walks.  Or maybe they think working the count is important in OBP so Intentional Walks shouldn’t count.

Here is a table showing the highest  IBB/PA years Barry Bonds’ career had

Year Name_TeamID OFFICIAL OBP KISS OBP IBB/PA
2002 Barry_Bonds_SFN 0.582 0.624 10.00
2003 Barry_Bonds_SFN 0.529 0.576 9.98
2004 Barry_Bonds_SFN 0.609 0.673 16.28

In 2004 Barry Bonds has an official OBP of 0.609 and 0.673 with the KISS formula.  This is an extreme case.  According to IBB/PA, Bonds was Intentionally Walked 16.28% that year which is around 1/6 of every plate appearance.  The MLB historical average for Intentional Walks is 0.7% ( once every 140 PAs) which means Bonds’ accrued 20x what is normal.

Let’s bring this down a notch and look at 3 years of your average very good player, Anthony Rizzo.

Year Name_TeamID Official OBP KISS OBP IBB/PA
2014 Anthony_Rizzo_CHN 0.386 0.393 1.12
2015 Anthony_Rizzo_CHN 0.387 0.394 1.27
2016 Anthony_Rizzo_CHN 0.385 0.392 1.17

Rizzo gets about 50% more Intentional Walks than normal and the difference between his Official OBP and the Keep It Simple approach is negligible.  It becomes even more negligible when you drop down to the good players, average players,  and below.

The bottom line:  We get our data from official MLB sources which provide the PA stat without IBB.  We don’t calculate game stats so we will take and display whatever number they come up with for OBP.  For historical purposes OBP is meaningless.  It is not a value stat and cannot be used for ranking purposes.

Edit  5/3/2018: Since we take  PA from official sources, in order to incorporate Intentional Walks we would have to add them back.  That violates KISS and incorporates our bias into an equation.  From now on we’ll accept that Intentional Walks don’t count and since IBBs are so few they are irrelevant to the final OBP anyway.  Walks will be simply W = BB + HBP .

That is the end of OBP talk forever.  Now that we have exhausted the O in OPS, the next part to this series will hit the S.  Until then….