Featured post

About this site

This site is a public logbook on the development of a baseball data model that measures baseball player value and ranks them from best to worst.  This model contains the current 30 MLB franchises, their minor league affiliates, and their historical teams.   It covers all seasons and all players from 1900 – 2017.

Browse the Table of Contents for more information.  We covered the 2017 season extensively.  Not much published here in 2016 even though the Cubs won and it has been sporadic the years before starting in September 2013.

The goal of this data model is to become an app that user can quickly evaluate a player being talked without knowing anything about baseball.   They can then become the smartest person in the room about that player.  There will be a handicapping component but that is a work in progress and hasn’t been proven.  We have a solid proof for the WAA measure, something WAR does not have.

Cubs Reds Matchup

Cubs start a four game series with the Reds in Cincinnati.  Let’s look at this matchup focusing on the team the Cubs face.  A new Cubs status is coming soon.  Here is the current team status for CIN.

-3.9 -52.0 315 376 28 45 0.1 -1.2 CIN

BAT slightly underwater but the PITCH looks bad.  CIN 5th from worst in the league which is KCA at -92.   We should see more evidence of this as we drill down into the Reds team.  They have real WAA=-17.  As we progress in the season, the current talent on a team will differ from the seasonal numbers posted above.  Let’s see what the Ouija Board thinks of this matchup.

Ouija Board

DATE 06_21_7:10_PM CHN CIN

LINEAWAY CHN [ 0.612 ] < 0.615 >
STARTAWAY 1.16(0.562) Kyle_Hendricks_CHN TIER 2
LINEHOME CIN [ 0.403 ] < 0.400 >
STARTHOME -2.62(0.310) Matt_Harvey_TOT TIER 5
CHN 42 29 CIN 28 45
CHN Lineup 2 ==> CIN Starter 5 / Relief 1 == 0.591 CHN 5.06 runs
CIN Lineup 3 ==> CHN Starter 2 / Relief 1 == 0.409 CIN 4.12 runs

Cubs lineups have been fluttering between Tier 1 and Tier 2 which means they’re right on the border.  Borders don’t matter in the simulations which will be explained more later.  Hendricks is back to Tier 2.  He also is right by a border, between Tier 2 and Tier 3.  Matt Harvey is having another bad year.  NYN dumped him and the Reds need someone to pitch and not be too terrible.  At Tier 5 Harvey is considered terrible.

The markets have the Cubs clear favorites at 0.615.  Based upon win/loss records, the Cubs are (what they say) 15 games ahead of CIN which translates into a deltaWAA=30 which translates into a 0.639 expected probability for CHN.

Win/loss records are seasonal.  Tier Combo simulations reflect who is actually on the roster in the lineup, relief, and the starter for today.  The simulations give the Cubs 0.591 advantage.  All of these numbers align with the market so both lines a complete discard.

This brings us to Matt Harvey.  Let’s look at his career because I recall him being a pretty good pitcher once.

Matt Harvey Career

Year WAA Name_TeamID Pos Rank
2012 1.8 Matt_Harvey_NYN PITCH +187+
2013 6.4 Matt_Harvey_NYN PITCH +010+
2015 5.3 Matt_Harvey_NYN PITCH +020+
2016 -1.5 Matt_Harvey_NYN PITCH XXXXX
2017 -5.0 Matt_Harvey_NYN PITCH -010-
2018 -1.8 Matt_Harvey_NYN PITCH -020-
2018 -0.8 Matt_Harvey_CIN PITCH -020-
Total 4.4

His best years were 2013 and 2014 and worst season was last year and he’s on track to exceed that.  His upside capability is fantastic but hitters must have figured out how to hit him and maybe he can’t figure out how to fix that.  Let’s look at Tier Data for the Reds.

CIN Tier Data

Type Tier Name_Teamid WAA
Lineups 3 CIN 3.49
SP 5 Luis_Castillo_CIN -3.13
SP 3 Anthony_DeSclafani_CIN -0.19
SP 5 Matt_Harvey_CIN -2.62
SP 3 Tyler_Mahle_CIN 0.17
SP 5 Sal_Romano_CIN -2.75
RP 1 CIN 7.22

Lineup is Tier 3 average which is what we would expect from their BAT in team status.  Three Tier 5 starters in their rotation which is also what we would expect from their PITCH in team status.   They have a Tier 1 relief staff which is unexpected.  Let’s take a look at CIN relief.

Rank WAA Name_TeamID Pos
+041+ 2.39 Jared_Hughes_CIN PITCH
+135+ 1.39 Amir_Garrett_CIN PITCH
+163+ 1.22 Raisel_Iglesias_CIN PITCH
+170+ 1.18 David_Hernandez_CIN PITCH
+195+ 1.03 Michael_Lorenzen_CIN PITCH
XXXXX 0.88 Dylan_Floro_CIN PITCH
XXXXX -0.04 Jackson_Stephens_CIN PITCH
-092- -1.41 Wandy_Peralta_CIN PITCH
Total 6.64

Unless a reliever is an uber closer, they usually get the short end of the stick when it comes to most baseball analysts.   All relievers on a relief staff, whatever inning they pitch are very valuable since they usually pitch 1/3 of each game, but as a team.  A run given up in the 7th inning is equally important mathematically as a run given up in the 9th.  As we saw last year, most successful teams in the playoffs have Tier 1 relief squads.

CIN isn’t making the playoffs this season but maybe Theo can acquire one of these guys for next season or even this season somehow.  You can never have enough pitchers!

CIN Lineup

WAA Name_TeamID Pos PA 06202018
1.26 Scott_Schebler_CIN RF 220
-0.90 Tucker_Barnhart_CIN C 241
0.38 Joey_Votto_CIN 1B 313
2.25 Scooter_Gennett_CIN 2B 286
3.25 Eugenio_Suarez_CIN 3B 242
-0.99 Jesse_Winker_CIN LF 235
-0.99 Jose_Peraza_CIN SS 299
-0.53 Tyler_Mahle_CIN P 23
-0.25 Billy_Hamilton_CIN CF 244
TOTAL WAA=3.49 PA=2103 WinPct=0.532

This is a Tier 3 average lineup.  WAR has Joey Votto ranked much higher than this data model has.  Joey Votto has admitted to being a big fan or WRC+ .  Joey Votto plays to maintain his WRC+.  His actual run production, which makes the Reds 17 games under is slightly above average.  WAR has him ranked #47 amongst all baseball players both pitchers and batters.  WAR has Joey Votto ranked significantly higher than Anthony Rizzo.  Who would you rather have on your team?  That is how you evaluate a value stat.

That is all for today.  Cubs status coming soon and it’s almost time for our All Star picks once we get data as to who the fans picked.  That’s always fun.  Fans like to pick based upon players who do well for their Draft Kings teams,  which doesn’t always agree with who has been helping their real teams win.  Until then ….

What is an OPS Part 5

I have been sitting on these tables for almost a month.  In the last part to this series we meandered into run creation estimation.  Run creation uses the Total Base (TB) stat which is a numerator in  Slugging Ratio (SLG).   On Base Percentage (OBP) is the chocolate and SLG is the peanut butter.  Together they make up the OPS butter cup proving that, in mathematics, any two numbers can be added together to make a third.

Total Bases is a useful game stat and  SLG is a helpful reference for game by game management players.  It was shown In Part 4 of this series that TB/H, how many Total Bases per hit, converges to almost exactly 1.5 or 3/2.   It might be possible to prove that when plate appearances approaches infinity, TB/H has to converge to 3/2.

If a batter is hitting 0.300 BA his SLG should be around 0.450   More then he’s getting a lot of extra bases, less he’s getting less than average.  Neither BA or SLG are value stats but they could be used to help see matchup opportunities etc.  The reason this has meandered into Run Creation because that’s a big aspect to how the value stat WAR is determined.  OPS is even used as a value stat by TV sports announcers.  ( Hello Jim Deshaies! )

Another interesting factoid is how the number of runs scored converges to the following formula:

Runs = Hits/2 

There might be a way to prove that for infinite number of PA the above is always true.  The very basic runs created formula, according to Wikipedia is:


The following table is a total compilation of data by decade.  We’ll drill down deeper in the next part to this series.  Highlighted in blue is the formula with less error for that decade.

Fun with numbers

Decade H/2 (KISS)
1920-1929 1.021 0.953
1930-1939 0.988 0.952
1940-1949 1.034 0.955
1950-1959 0.994 0.981
1960-1969 1.041 0.973
1970-1979 1.046 0.983
1980-1989 1.026 0.986
1990-1999 0.969 0.992
2000-2009 0.957 1.013
1920-2017 1.004 0.985

The above ratios are the estimated runs using each formula divided the actual number of runs scored.  Not sure what the above is supposed to mean other than the RC formula using TB has been much more accurate in modern baseball.

Below are the total averages for BA, OBP, and SLG for each decade in our study.  Not sure how this is relevant but added to provide some perspective.  In theory another column could have been added …. but didn’t want to confuse things.  All three stats in the below table are useful on their own.  Mashing them up into some other number obfuscates their benefit as a game stat.

1920-1929 0.285 0.336 0.397
1930-1939 0.279 0.336 0.399
1940-1949 0.260 0.327 0.368
1950-1959 0.259 0.327 0.391
1960-1969 0.249 0.311 0.374
1970-1979 0.256 0.320 0.377
1980-1989 0.259 0.320 0.388
1990-1999 0.265 0.331 0.410
2000-2009 0.265 0.332 0.424
2010-2017 0.255 0.318 0.405
1920-2017 0.262 0.325 0.395

The next table are some interesting ratios from each decade.   TB/H converges to almost 1.5 and Hits/Walks converges to almost 5/2.  There are around 1/8 ~ 12.5% more recorded plate (PA) appearances than at bats (AB).  We covered these stats earlier.  Sac Flies and Bunts and all the other little stuff are negligible variables that should be eliminated.  Walks are the reason for that 1/8 difference.  Since this model uses official PA stat  IBBs are not included.  Those are negligible as well (i.e. doesn’t matter in the big picture).

Decade PA/AB H/W TB/H
1920-1929 1.129 3.040 1.391
1930-1939 1.116 2.873 1.432
1940-1949 1.126 2.417 1.415
1950-1959 1.130 2.352 1.509
1960-1969 1.120 2.510 1.503
1970-1979 1.125 2.488 1.471
1980-1989 1.120 2.593 1.498
1990-1999 1.128 2.443 1.548
2000-2009 1.126 2.458 1.597
2010-2017 1.115 2.555 1.590
1920-2017 1.123 2.541 1.507

The next part of this series will break all this down where error can be measured from season to season, team to team, player to player.  Spoiler Alert:  The official RC formula with TB has far less error than Hits/2 on a season to season and team to team basis.   The two formulae are equal on a player to player basis, each having around a +/- 20% error compared to actual runs that scored.

More on this data and methodology to come.  Until then ….

Cubs Dodgers Matchup

Tonight the Cubs start a series with the Dodgers at Wrigley.  Let’s take a look at LAN so far this season and what the Ouija Board thinks of these two teams.  Here is current team status for LAN.

21.3 31.5 334 278 37 33 -0.8 4.0 LAN

Above average PITCH and BAT making this a well balanced team.  At +4 real WAA they are, like the Cubs, under performing the PE estimate based upon run differential which is around +12 for LAN.  Let’s see how this plays out with the Ouija Board.

Ouija Board

DATE 06_18_8:05_PM LAN CHN

LINEAWAY LAN [ 0.545 ] < 0.524 >
STARTAWAY 0.59(0.546) Kenta_Maeda_LAN TIER 3
LINEHOME CHN [ 0.500 ] < 0.500 >
STARTHOME 0.02(0.501) Tyler_Chatwood_CHN TIER 3
LAN 37 33 CHN 40 28
LAN Lineup 1 ==> CHN Starter 3 / Relief 1 == 0.461 LAN 4.45 runs
CHN Lineup 2 ==> LAN Starter 3 / Relief 2 == 0.539 CHN 4.86 runs

Update 6/19/2018: There was a bug in the way data propagated that caused an error in the above.  The sim results should be 0.497 CHN, 0.503 LAN making this a clear discard.  Not sure how this happened but after the rain delay the results were different on the second day.  This is still a work in progress.  Cubs lost this game 4-3.

Still experimenting with colors to make this more readable.  The market set this game pretty much even steven with Dodger bettors paying the juice.  Both Tier Combo simulations and DeltaWAA give the Cubs a clear advantage  but less than our 0.07 margin.   If you’re a Cubs fan it might be OK to bet this for fun since  Cubs lines have not been a betting opportunity that often these last few years.  Our algorithm would kick both lines however, do not bet.

LAN is fielding a Tier 1 lineup and Cubs’ lineup dropped to Tier 2 the last few days.  Cubs still have a Tier 1 Relief staff and both starters are mediocre Tier 3.  Let’s look at how Tier data for LAN breaks down.  Tier data for CHN was shown in the last Cubs status post a couple days ago.

LAN Tier Data

Type Tier Name_Teamid WAA
Lineups 1 LAN 7.75
SP 4 Caleb_Ferguson_LAN -0.88
SP 3 Kenta_Maeda_LAN 0.59
SP 1 Ross_Stripling_LAN 3.53
SP 3 Alex_Wood_LAN -0.29
RP 2 LAN 4.39

Above is how Lineups, Relief (RP), and starters break down for the Dodgers so far this season.  Let’s look at the Dodgers lineup from the last game

LAN Lineup

WAA Name_TeamID Pos PA 06182017
1.68 Joc_Pederson_LAN LF 180
1.78 Max_Muncy_LAN 1B 158
0.21 Justin_Turner_LAN 3B 90
1.28 Cody_Bellinger_LAN CF 274
1.89 Yasmani_Grandal_LAN C 225
1.36 Yasiel_Puig_LAN RF 198
0.25 Enrique_Hernandez_LAN SS 169
-0.69 Logan_Forsythe_LAN 2B 129
-0.02 Caleb_Ferguson_LAN P 1
TOTAL WAA=7.75 PA=1424 WinPct=0.604

That is a top tier league lineup with a lot of well above average hitting talent.

That is all for now.  If lines go haywire a post will be made but don’t expect that in this series.  There have been some real haywire lines with Baltimore the last few days.  Not sure the reason but we’ll get into that more later.  Still working on the plumbing to hook simulation results with a betting algorithm.  Once that’s complete we can see if any of this actually works.  Until then ….

The Simulation Part 2

This post will provide some inside baseball (pun intended) on some of the coding going on in this data model in general and the Monte Carlo simulations in particular.  Errors in compiling data will propagate into results so every effort has been made to provide inline error checking everywhere.  A script will abort if the error absolutely needs fixing and print SYSERRs into the data if that error can be fixed later.

Cleaning up SYSERRs and bugs takes most of the time.   There are still many low priority SYSERRs currently in the data set that eventually will need to be purged but they pose no harm to simulations, player evaluations or ranking.

The dataset used for Monte Carlo simulation are all games from years 1970-2016 which is a little over 100,000 games.  Event data from retrosheet.org, required to take day by day league snapshots, are pretty reliable in these years.  Although this data model has compiled game by game snapshots going back to around 1920, the early eras may skew simulations.

These simulations rely on three aspects; lineup value, starter value, and relief value.  In pre-1970 years, and perhaps pre-1980, relief pitching was not valued.  Washed up pitchers were put in relief and usually used when the game was already lost and everyone was thinking about what bars to hit afterward.  Starters pitched many complete games and often more than 300 innings per season.   Since relief is such an important aspect in modern baseball, it may not be wise to include earlier than 1970 in the simulation dataset.  These simulations will be used to evaluate modern baseball.  Even between 1970-1980 relief wasn’t valued like it is today.

March and April games are excluded each year because there isn’t enough current year data to accurately value players.  The error eliminated by excluding those months exceeds the error incurred from a slimmer dataset.

 The Simulation

Each game consists of four pairs:

  1. away lineup -> home starter (l-s)
  2. away lineup -> home relief (l-r)
  3. home lineup -> away starter (l-s)
  4. home lineup -> away relief (l-r)

There are two pair types, lineup -> starter (l-s) and lineup -> relief (l-r).

Starters and groups of players for lineup and relief are placed into 5 tiers based upon league averages and standard deviations.  These calculations are made at the beginning of every day for every team between 1970-2016.  Each pair type will have its own distribution which consists of many thousands of games.  The average pair type 3-3 , average against average, will have distributions representing the most games.

The simulation will run millions of iterations.  Each iteration will randomly grab a pair of numbers from the lineup -> starter distribution.  This pair is number of runs and inning pitched by a starter from an actual game in the dataset.   An average runs/out given up from a random game will be returned for lineup -> relief pair lookup.

The number of relief innings is determined by the starting innings pitched returned from the  l->s set.  Multiply that by the  average relief runs/out, add to starter runs given up, and you have total runs for a team.  Do this for home and away.  The lineup with the most runs wins that iteration, the other team loses.  Do this millions of times and you converge to a win/loss percentage as well as average runs per game.

More detail will be provided in subsequent parts to this series.

The number of iterations in a simulation determines its error.  When the code was first working 1000 iterations was taking 15 seconds per game and producing a large error which was described in this post.  To do 10K iterations would take 3 minutes, 100K 30 minutes, etc.  which was totally unacceptable.  It could take 8 hours to produce results for one day of baseball.  Multiply that by 180 days in a baseball season and that gets ugly.

Next step was figuring out how to use parallelism and eliminate waste in loops.  What I found was quite amazing as to how simple the solution to this problem was.

The Bug

Now let’s get into some inside baseball perl script nerd talk.

The game results for each pair type gets pushed into an array.  These arrays can often hold data from more than 10,000 games each.   The beginning of the script reads the entire 100K game dataset and populates two array hashes %lstierlookup and %lrtierlookup , keyed by  tier pair.

Later in the script when it has to do a lookup it must index an array like this:

$myindex = int rand ( $#myarray ) ;
$mylookup = @myarray[ $myindex ] ;

The variable $myindex is a random integer the size of the array so it ends up indexing a random item in @myarray which returns variables from a real game in the dataset.  Not too complicated.

Often when I can’t figure out syntax to something I use the Keep It Simple Stupid method and use a work around.  I couldn’t figure out the syntax to index an array hash so I did this.

my @lshomearray = @{$lstierlookup{$homerlstierkey}} ;
my @lsawayarray = @{$lstierlookup{$awaylstierkey}} ;
my @lrhomearray = @{$lrtierlookup{$homerlrtierkey}} ;
my @lrawayarray = @{$lrtierlookup{$awaylrtierkey}} ;

Now the array can be easily indexed using syntax shown above.  The perl code however actually transferred the entire 10K+ items in each array to the other array.  This was in the iteration loop which means 1000 iterations meant 4000 array transfers.   The Keep It SImple Stupid method of being lazy not looking up the proper syntax trumped the Keep It Simple Stupid when if comes to CPU cycles required to process the loop.

Now 1,000,000 iterations takes about 10 seconds and has virtually no error due to simulation.  I can run the same sim 100 times and virtually no variation in the outcome.

The proper syntax for indexing a hashed array is:

my %hasharray = () ;
my $indexed_value = $hasharray{$mykey}[$myindex] ;

And that was all there was to it.  More detail on these simulations will be explained in subsequent parts to this series.  The above is the gist of how the win/loss percentages are calculated.

There is a problem with strict tiering.  For example, two Tier 3 players could be on opposite ends of their boundaries or a Tier 4 and a Tier 3 player could be almost equal in value, only separated by an arbitrary boundary.  This problem has been solved using a somewhat different method.

The four pairs (home and away, l-s, l-r) and what those distributions store and return to determine wins and losses remain the same as described above.  All of these simulations rest upon the foundation of this data model and how it assigns value to players and groups of players.  More on this later.  Until then ….

Cubs Cardinals Matchup

Cubs start a weekend series with the Cardinals in St. Louis.   Let’s look at the current status of SLN and who the Cubs pitchers and batters will face.

-0.2 24.9 132 105 17 12 3.0 -0.7 SLN 5/4/2018
-13.5 32.9 281 263 36 30 -0.8 -0.6 SLN 6/15/2018

Cardinals’ pitching up, hitting down and they went from a real WAA=+5 to +6 since May 4, the last time Cubs faced them.  SLN has been moving sideways.  Moving sideways is preferable to moving down and they are very much in contention — as they almost always are throughout every season.  Let’s see what the Ouija Board thinks of these two teams.

Ouija Board

DATE 06_15_8:15_PM CHN SLN

LINEAWAY CHN [ 0.524 ] < 0.519 >
STARTAWAY 3.42(0.700) Jon_Lester_CHN TIER 1
LINEHOME SLN [ 0.524 ] < 0.505 >
STARTHOME 2.90(0.670) Michael_Wacha_SLN TIER 1
CHN 38 27 SLN 36 30
CHN Lineup 1 ==> SLN Starter 1 / Relief 3 == 0.568 CHN 4.45 runs
SLN Lineup 3 ==> CHN Starter 1 / Relief 1 == 0.432 SLN 3.75 runs

The market started out with a we don’t know at 0.524 for each team. They narrowed the juice after betting began. DeltaWAA based entirely on win/loss record give the Cubs 0.542 expected probability. Tier Combo simulations give the Cubs 0.568 today. Cubs fielding a perfect 1-1-1 lineup-starter-relief set. SLN has a Tier 3 average lineup and Tier 3 average Relief which is the reason they’re underdogs today. Tomorrow will be different with different starters. Lineups and Relief tiers don’t usually change that much on a day to day basis.

The Cubs could be a betting opportunity but the margin between 0.519 and 0.568  of .05 is below our threshold of .07.  These lines can move during the day and Cardinals fans have been known to bet on their team because they usually have a good chance of winning.   Not using any stats the Cubs play the Cardinals pretty much 50/50 no matter what the mismatch of the talent is.

Past results do not affect future results and past results from decades of games and different players really do not matter.  The current players matter.  If the past mattered the Cubs would have never won a World Series or the NLDS three years in a row.  Let’s look at Tier Data for SLN.

Tier Data

Type Tier Name_Teamid WAA
Lineups 3 SLN 1.09
SP 2 Jack_Flaherty_SLN 1.20
SP 1 Carlos_Martinez_SLN 2.14
SP 1 Miles_Mikolas_SLN 3.30
SP 1 Michael_Wacha_SLN 2.90
SP 4 Luke_Weaver_SLN -0.73
RP 3 SLN 3.19

Three Tier 1 starters who the Cubs will probably face.  Excellent starting rotation.  Since SLN PITCH in team status was above league average but not that above average the sacrifice came from their relief staff who are Tier 3, average.

SLN Relief

Rank WAA Name_TeamID Pos
+089+ 1.66 Jordan_Hicks_SLN PITCH
XXXXX 0.59 Mike_Mayers_SLN PITCH
XXXXX 0.55 John_Brebbia_SLN PITCH
XXXXX 0.48 Bud_Norris_SLN PITCH
XXXXX 0.38 Sam_Tuivailala_SLN PITCH
XXXXX 0.06 Austin_Gomber_SLN PITCH
XXXXX -0.53 Brett_Cecil_SLN PITCH
Total 3.19

Almost all average or above average pitchers.  Cardinals can improve this throughout the season.  And finally, below is the lineup SLN fielded their last game.  Don’t want to wait around for tonight’s lineup since it will probably be very close.

SLN Lineup

WAA Name_TeamID Pos PA 06132018
-0.02 Harrison_Bader_SLN RF 128
1.41 Tommy_Pham_SLN CF 248
1.41 Jose_Martinez_SLN 1B 261
0.25 Marcell_Ozuna_SLN LF 257
0.02 Yadier_Molina_SLN C 148
-0.10 Jedd_Gyorko_SLN 3B 140
-0.10 Yairo_Munoz_SLN SS 92
-1.62 Kolten_Wong_SLN 2B 177
-0.15 Luke_Weaver_SLN P 24
TOTAL WAA=1.09 PA=1475 WinPct=0.514

This is considered a Tier 3 or league average lineup.  Looks like their worst starter, Luke Weaver pitched 2 days ago so the Cubs will be facing Tier 1 or Tier 2 starters the next two games.   For all these lineup tables, the WAA value assigned pitchers is their value as a hitter (BAT).  Their PITCH value is a completely separate record and their BAT value does not get counted towards their career value.  That would be mean.  :-)

That is all for now.  A post will be made If the lines for this series go haywire.  Currently working on a league scan for current day betting opportunities.   A minor league update coming soon and All Star picks at the end of the month.  Until then ….