Category Archives: Information

Cubs Brewers Matchup

Cubs start a short two game series with the Brewers today. Let’s look at the Brewers to get an idea what the Cubs are up against.

MIL Team Status

BAT PITCH Rs Ra W L UR LR TeamID
-0.9 38.2 284 242 39 26 2.4 2.2 MIL 6/11/2018
-0.6 20.8 533 507 67 54 1.6 4.2 MIL 8/14/2018

The Brewers were +13 two months ago and they’re +13 today.  Their BAT remained the same but PITCH tanked a little.  Their PE estimate suggests they should be at around +6 so they are exceeding expectations with respect to run differential.  The MLB commissioner however only looks at the win/loss column when determining who goes to the playoffs.  Let’s hear what the people think of these two teams.

Ouija Board

DATE 08_14_2:20_PM MIL CHN

LINEAWAY MIL [ 0.455 ] < 0.435 > +130 $230
STARTAWAY 0.59(0.519) Jhoulys_Chacin_MIL TIER 3
--------------------------------------------
LINEHOME CHN [ 0.583 ] < 0.583 > -140 $171
STARTHOME -0.29(0.489) Jose_Quintana_CHN TIER 3
--------------------------------------------
MIL 67 54 CHN 68 49
DELTAWAA 6 WINPCT 0.542 CHN
--------------------------------------------
TIER COMBOS
MIL Lineup 2 ==> CHN Starter 3 / Relief 2 == 0.495 MIL 4.67 runs
CHN Lineup 1 ==> MIL Starter 3 / Relief 2 == 0.505 CHN 4.73 runs
--------------------------------------------
EXPECTED VALUE
deltaWAA EV MIL 105 CHN 93
TCsim EV MIL 114 CHN 86

The Tier Combo simulations call this an almost even steven game.  The market favors the Cubs with a 58.3% break even probability of winning today.   Expected Value based upon Tier combo simulations give MIL 114 which is well above 100 but below our 120 threshold.  Both lines a discard.  Let’s look at the Tier breakdown for MIL.

MIL Tier Data

Type Tier Name_Teamid WAA
Lineups 2 MIL 8.32
SP 3 Chase_Anderson_MIL 0.38
SP 3 Jhoulys_Chacin_MIL 0.59
SP 2 Junior_Guerra_MIL 1.97
SP 3 Wade_Miley_MIL 1.74
SP 3 Freddy_Peralta_MIL -0.50
RP 2 MIL 6.07

That’s a pretty mediocre starting rotation.  They have a Tier 2 above average relief staff (RP) however.  Lineup is also Tier 2 which exceeds what would be expected from their completely league average BAT in team status.  Team status is based upon seasonal run differential and as the season progresses, negative players go and positive players are acquired — especially for teams in the hunt for a playoff spot.  Below are the relievers Cubs will face according to our source for current rosters.

MIL Relief

Rank WAA Name_TeamID Pos
+034+ 3.63 Jeremy_Jeffress_MIL PITCH
+037+ 3.46 Josh_Hader_MIL PITCH
XXXXX 1.11 Dan_Jennings_MIL PITCH
XXXXX 0.46 Jacob_Barnes_MIL PITCH
XXXXX 0.08 Corbin_Burnes_MIL PITCH
XXXXX -0.63 Jordan_Lyles_TOT PITCH
XXXXX -0.69 Corey_Knebel_MIL PITCH
-185- -1.24 Aaron_Wilkerson_MIL PITCH
Total 6.18

Note:  There is a difference in total value between this table and the Tier Data table.  This is due Jordan Lyles whose TOT value is an estimate — but it’s very close.   The sims use the Tier Data table.

Now let’s look at that Tier 2 lineup Cubs pitching will face.

MIL Lineup

WAA Name_TeamID Pos PA Aug_12_1:35_PM
-0.92 Lorenzo_Cain_MIL CF 439
3.97 Christian_Yelich_MIL LF-RF-CF 454
5.29 Jesus_Aguilar_MIL 1B 393
1.51 Ryan_Braun_MIL LF-1B 313
-0.50 Hernan_Perez_MIL 2B-RF-3B-LF-SS 257
1.07 Mike_Moustakas_TOT 3B-DH 470
-0.15 Jonathan_Schoop_TOT 2B 402
-1.16 Manny_Pina_MIL CR 266
-0.80 Chase_Anderson_MIL PR 38
TOTAL WAA=8.32 PA=3032 WinPct=0.553

According to this data model Jesus Aguilar is right behind Javier Baez for NL MVP.   Again, past results do not affect future results.  These numbers only show capability to estimate a probability.  Anyone who thinks their math can predict the future is peddling in Astrology — a field many people believe is true.

That is all for now.  Game starts in 7 minutes … Go Cubs.

How often does leadoff man get a hit?

The other day at the pub there was an argument whether Joe Maddon preferred beer or wine.  Since he is spokesman for Binny’s Beverage Depot holding a glass of wine people thought that was proof of his preference for wine.   That led to a question about the Binny’s leadoff man promo:  How often does the leadoff man get at hit?

Since I don’t get event data for this year until December from retrosheet.org I can’t tell.  Luckily there’s this site called Google that has answers to every question imaginable.  Here is  current year data according to Binny’s Beverage Depot.

binnys2

As of August 10 the Cubs played around 114 games so 28/114 = 0.245.   Binny’s has a payout ratio of around 1/4.  If you were to bet the Binny’s leadoff man promo this would be good to know.  It seems low however.   Usually managers put the hottest hitter in the leadoff spot because the leadoff hitter bats the most.  Mathematically you want your best hitter to get the most plate appearances.  There are always exceptions.

The 0.245 above is not a batting average and since retrosheet.org does not publish current year event data until December we don’t know how many walks there are this season at the leadoff spot.  We know there are always exactly 0 sac bunts and sac flies for the first batter in every game.  We have event data from past years.  Here is a table compiled for the year 2017.

Year Hits Binnys % CUBS BA LEAGUE % LEAGUE BA
2017 41 0.253 0.270 0.243 0.265

Second column shows Binny’s paid out on 41 hits with a win % of almost exactly 1/4.  The  LEAGUE % column is if Binny’s had to pay out for every team in every 2017 game.  It is lower than the Cubs which should be expected because the Cubs had a good team last year.

The fourth column incorporates walks into the BA stat and it’s 0.270, also higher than league average.  But what does this mean?  There are a bunch of tables In Part 5 of our OPS series.  Scroll down to second table and you will see the average league wide BA from 2010 – 2017 is 0.262 which is almost matches the LEAGUE BA column  above.

What does that mean?  I don’t know.  Here is a complete table from 2010 – 2017.

Year Hits Binnys % CUBS BA LEAGUE % LEAGUE BA
2010 33 0.204 0.213 0.238 0.258
2011 49 0.302 0.327 0.241 0.261
2012 40 0.247 0.265 0.242 0.262
2013 37 0.228 0.245 0.236 0.257
2014 40 0.247 0.263 0.244 0.263
2015 35 0.216 0.240 0.244 0.262
2016 52 0.321 0.382 0.260 0.284
2017 41 0.253 0.270 0.243 0.265

One of the baseball constants used in this data model is  1 game = 38.4 plate appearances.  PA is a measure of playing time and that’s how you convert PA into games.  Pitchers have a baseball constant used forever which says that there are exactly 9 innings to a game.  A game is not always 9 innings but they kept the math simple because they didn’t have calculators when Ted Williams played.  In the end it doesn’t matter.  Nine innings/game is close enough just like 38.4 PA/game.

The leadoff hitter then represents 1/38.4 = 0.026 = 2.6% of all plate appearances.  This lack of data leads to a large year to year variation in payouts which you can see in the above table.  Binny’s had to payout almost a third of the time in 2016.

Not sure what the above is supposed to mean however.  New series with the Brewers tomorrow so we’ll have a look see into their team.  Until then ….

The Ouija Board Part 4

The Ouija Board is once again expanding.    It is called a  Ouija Board here because markets are influenced by many people resulting in an outcome that comes very close to an actual probability — as if it was controlled by the beyond or the future.  We can measure error in the market by comparing them to  actual results (because we’re from the future now).  The market, or the house line, is very accurate in most cases.

The purpose of this handicapping model is to validate player and team evaluation math behind this data model.   If this model is a more accurate representation of the past than Sabermetrics, then our handicapping should have less error than the market which means the market can be exploited to gain a percentage edge on the house — much like what is done counting cards in blackjack.  The house only sets the opening line, the market, all the people pushing and pulling on that Ouija Board, set line adjustments during the day.

In Part 4 of this series a league wide table will be introduced showing Expected Value instead of probabilities.  This concept can get very complicated but our use case is rather simple.  We only have a single probability and a single value so our equation looks like this:

EV = P(win) * Value

In past Ouija Board sections we have been eyeballing percentages.  Our margin was 0.07 over the break even probability in order to for a line to become a betting opportunity.   The break even probabilities in all Ouija Board sections have an expected value of $100 on a $100 bet.  This means in the long run you break even betting a line where your expected probability equals that.   It makes no sense to bet if you only break even, less so if your expected value is less than your bet.  In all games in Vegas your expected value on every bet is less than your bet.  The house always gets a cut which pays for all the flickering lights.

Although adding a rough margin to probabilities was fine for eyeball estimates to see if we’re in the right ballpark (pun not intended),  it’s not proper to use that kind of math in an algorithm.  We must manipulate expected values, not probabilities which are ratios between 0 and 1.

A simple example of expected value is flipping a coin.  Heads and tails are two equal outcomes so each has a probability = 0.500 or 1/2.  If you bet $1 on heads or tails your expected value of that bet would be:

EV = P(heads or tails) * $(1+1) = 0.5 *$2 = $1

Not too complicated.  Since your expected value on a $1 bet is exactly $1 if you played this bet a trillion times you would end up losing nothing and gaining nothing.  It would be a complete waste of time.  If you had a loaded coin where you increase the probability of your pick to say 0.600 instead of 0.500 then your EV would be:

EV = 0.600 * $2 = $1.20

This means for every bet you will average $0.20 profit and if you play this a trillion times you’ll be very rich with virtual certainty.

Like a flip of a coin, the probability behind a baseball game is just as simple.  There are two possible outcomes, home team wins or away team wins — that’s it.  If you knew nothing of either team or where they’re playing you would have to assume each team has a probability of 0.500 — like a flip of a coin.

We have more information however.  Home field advantage is historically 0.540 home team, 0.460 visiting team.  If you knew nothing other than which team is home and away you could assume home team probability is 0.540 and away field is 0.460.

Just knowing home/away is not good enough however.   There are differences in win loss records and differences in starting pitching, lineups, and relief.  These are somewhat  independent from each other in that win loss record has some influence on the value of talent playing for a single game but we don’t know how much.  Teams with great records have been known to slide into oblivion at various points in the season and vice versa.  This demise or rise would be due to the makeup of the talent on the team.  This is where Tier Combo simulations come into play and the basis for our handicapping.

Update 7/13/2018:   There is a relationship between win/loss record differences (deltaWAA) and the Tier Combo simulations based upon player talent.  That relationship is unknown right now.

As has been explained in the many matchup posts here, deltaWAA is a table lookup that represents the differences in wins and losses between the two teams.  DeltaWAA is double what most people call games behind.   We chose a dataset from 1970 – 2016 and counted wins and losses based upon deltaWAA and derived a table of win/loss percentages from it.  So if deltaWAA is say 10 the higher team should have a probability greater than 0.500.  That is one way of  handicapping two teams.

The Tier Combo simulations take a snapshot of value for each day of those 46 years and run the tiering calculations that are done each day for this current season.  Historical data  based on the combination of talent between the two teams are turned into a distribution which is used in simulation to estimate win/loss percentage — which is an expected probability for that game.

Which probability is correct;  deltaWAA or Tier Combo?   Both of these could be independent or somewhat dependent.  A team that enjoys a high deltaWAA should have high value talent playing.  If they don’t then perhaps  Tier Combo simulation results should take precedent.

The following is a first draft of a table showing all the expected values on $100 bets for each game today 7/12/2018.  The Cubs start a new series tomorrow with SDN so more will be explained then.  Don’t like showing tables with a lot of numbers and the below will be consolidated in the future.

Expected Values for 7/12/2018

Away Home Away simEV Home simEV Away dWAA Home dWAA dWAA Fav dWAA Pct
ARI COL 92 103 112 83 ARI 0.565
NYA CLE 93 103 115 78 NYA 0.619
TBA MIN 92 105 107 88 TBA 0.582
PHI BAL 105 90 139 56 PHI 0.711
LAN SDN 90 110 94 103 LAN 0.619
MIL PIT NEW GUY STARTER FOR MIL
SEA ANA 100 95 121 74 SEA 0.619
TOR BOS 111 90 87 104 BOS 0.664
OAK HOU 126 83 110 91 HOU 0.619
WAS NYN 81 128 89 111 WAS 0.616

We’re not going to get into how the above was calculated until we do the Cubs tomorrow and can show it in detail for an individual game.   The expected values in bold blue show two possible betting opportunities.  Currently $120 is our EV threshold.  This may be relaxed in the future.   We only bet on simulation results — never deltaWAA.  DeltaWAA EVs are used to wave off bets.   Simulations are based upon the entire spectrum of teams with different wins/loss records.  If there are extreme differences in wins/losses it is better to play it safe and take a pass.  This also may change as we gather more data.

Highlighted are two possible betting opportunities ; Oakland and Mets.  Each has EV greater than 120 based upon Tier Combo simulations.  The brown colored numbers represent their corresponding EV based upon deltaWAA.  OAK drops to 110 and NYN drops to 111.  If you average the two both EVs fall under 120 making both a wave off discard.  This means there are no betting opportunities today.  It also means we can’t lose. :-)

Since MIL is starting a pitcher without enough innings (Wade Miley)  this season the entire game is tossed for lack of information.  A starter is one of the three important factors in the Tier Combo simulations.  Games do not get tossed when there are new guys in a lineup or relief.

That is all for now.  This is still a work in progress and the above EV table will become simpler to visualize in the future.  Tomorrow CHN starts a series with SDN so an new and expanded Ouija Board will be introduced to hopefully make more sense to the above table.  Until then ….

What is an OPS Part 5

I have been sitting on these tables for almost a month.  In the last part to this series we meandered into run creation estimation.  Run creation uses the Total Base (TB) stat which is a numerator in  Slugging Ratio (SLG).   On Base Percentage (OBP) is the chocolate and SLG is the peanut butter.  Together they make up the OPS butter cup proving that, in mathematics, any two numbers can be added together to make a third.

Total Bases is a useful game stat and  SLG is a helpful reference for game by game management players.  It was shown In Part 4 of this series that TB/H, how many Total Bases per hit, converges to almost exactly 1.5 or 3/2.   It might be possible to prove that when plate appearances approaches infinity, TB/H has to converge to 3/2.

If a batter is hitting 0.300 BA his SLG should be around 0.450   More then he’s getting a lot of extra bases, less he’s getting less than average.  Neither BA or SLG are value stats but they could be used to help see matchup opportunities etc.  The reason this has meandered into Run Creation because that’s a big aspect to how the value stat WAR is determined.  OPS is even used as a value stat by TV sports announcers.  ( Hello Jim Deshaies! )

Another interesting factoid is how the number of runs scored converges to the following formula:

Runs = Hits/2 

There might be a way to prove that for infinite number of PA the above is always true.  The very basic runs created formula, according to Wikipedia is:

(H+W)*TB/PA

The following table is a total compilation of data by decade.  We’ll drill down deeper in the next part to this series.  Highlighted in blue is the formula with less error for that decade.

Fun with numbers

Decade H/2 (KISS)
(H+W)*TB/PA
1920-1929 1.021 0.953
1930-1939 0.988 0.952
1940-1949 1.034 0.955
1950-1959 0.994 0.981
1960-1969 1.041 0.973
1970-1979 1.046 0.983
1980-1989 1.026 0.986
1990-1999 0.969 0.992
2000-2009 0.957 1.013
1920-2017 1.004 0.985

The above ratios are the estimated runs using each formula divided the actual number of runs scored.  Not sure what the above is supposed to mean other than the RC formula using TB has been much more accurate in modern baseball.

Below are the total averages for BA, OBP, and SLG for each decade in our study.  Not sure how this is relevant but added to provide some perspective.  In theory another column could have been added …. but didn’t want to confuse things.  All three stats in the below table are useful on their own.  Mashing them up into some other number obfuscates their benefit as a game stat.

Decade BA OBP SLG
1920-1929 0.285 0.336 0.397
1930-1939 0.279 0.336 0.399
1940-1949 0.260 0.327 0.368
1950-1959 0.259 0.327 0.391
1960-1969 0.249 0.311 0.374
1970-1979 0.256 0.320 0.377
1980-1989 0.259 0.320 0.388
1990-1999 0.265 0.331 0.410
2000-2009 0.265 0.332 0.424
2010-2017 0.255 0.318 0.405
1920-2017 0.262 0.325 0.395

The next table are some interesting ratios from each decade.   TB/H converges to almost 1.5 and Hits/Walks converges to almost 5/2.  There are around 1/8 ~ 12.5% more recorded plate (PA) appearances than at bats (AB).  We covered these stats earlier.  Sac Flies and Bunts and all the other little stuff are negligible variables that should be eliminated.  Walks are the reason for that 1/8 difference.  Since this model uses official PA stat  IBBs are not included.  Those are negligible as well (i.e. doesn’t matter in the big picture).

Decade PA/AB H/W TB/H
1920-1929 1.129 3.040 1.391
1930-1939 1.116 2.873 1.432
1940-1949 1.126 2.417 1.415
1950-1959 1.130 2.352 1.509
1960-1969 1.120 2.510 1.503
1970-1979 1.125 2.488 1.471
1980-1989 1.120 2.593 1.498
1990-1999 1.128 2.443 1.548
2000-2009 1.126 2.458 1.597
2010-2017 1.115 2.555 1.590
1920-2017 1.123 2.541 1.507

The next part of this series will break all this down where error can be measured from season to season, team to team, player to player.  Spoiler Alert:  The official RC formula with TB has far less error than Hits/2 on a season to season and team to team basis.   The two formulae are equal on a player to player basis, each having around a +/- 20% error compared to actual runs that scored.

More on this data and methodology to come.  Until then ….

The Simulation Part 2

This post will provide some inside baseball (pun intended) on some of the coding going on in this data model in general and the Monte Carlo simulations in particular.  Errors in compiling data will propagate into results so every effort has been made to provide inline error checking everywhere.  A script will abort if the error absolutely needs fixing and print SYSERRs into the data if that error can be fixed later.

Cleaning up SYSERRs and bugs takes most of the time.   There are still many low priority SYSERRs currently in the data set that eventually will need to be purged but they pose no harm to simulations, player evaluations or ranking.

The dataset used for Monte Carlo simulation are all games from years 1970-2016 which is a little over 100,000 games.  Event data from retrosheet.org, required to take day by day league snapshots, are pretty reliable in these years.  Although this data model has compiled game by game snapshots going back to around 1920, the early eras may skew simulations.

These simulations rely on three aspects; lineup value, starter value, and relief value.  In pre-1970 years, and perhaps pre-1980, relief pitching was not valued.  Washed up pitchers were put in relief and usually used when the game was already lost and everyone was thinking about what bars to hit afterward.  Starters pitched many complete games and often more than 300 innings per season.   Since relief is such an important aspect in modern baseball, it may not be wise to include earlier than 1970 in the simulation dataset.  These simulations will be used to evaluate modern baseball.  Even between 1970-1980 relief wasn’t valued like it is today.

March and April games are excluded each year because there isn’t enough current year data to accurately value players.  The error eliminated by excluding those months exceeds the error incurred from a slimmer dataset.

 The Simulation

Each game consists of four pairs:

  1. away lineup -> home starter (l-s)
  2. away lineup -> home relief (l-r)
  3. home lineup -> away starter (l-s)
  4. home lineup -> away relief (l-r)

There are two pair types, lineup -> starter (l-s) and lineup -> relief (l-r).

Starters and groups of players for lineup and relief are placed into 5 tiers based upon league averages and standard deviations.  These calculations are made at the beginning of every day for every team between 1970-2016.  Each pair type will have its own distribution which consists of many thousands of games.  The average pair type 3-3 , average against average, will have distributions representing the most games.

The simulation will run millions of iterations.  Each iteration will randomly grab a pair of numbers from the lineup -> starter distribution.  This pair is number of runs and inning pitched by a starter from an actual game in the dataset.   An average runs/out given up from a random game will be returned for lineup -> relief pair lookup.

The number of relief innings is determined by the starting innings pitched returned from the  l->s set.  Multiply that by the  average relief runs/out, add to starter runs given up, and you have total runs for a team.  Do this for home and away.  The lineup with the most runs wins that iteration, the other team loses.  Do this millions of times and you converge to a win/loss percentage as well as average runs per game.

More detail will be provided in subsequent parts to this series.

The number of iterations in a simulation determines its error.  When the code was first working 1000 iterations was taking 15 seconds per game and producing a large error which was described in this post.  To do 10K iterations would take 3 minutes, 100K 30 minutes, etc.  which was totally unacceptable.  It could take 8 hours to produce results for one day of baseball.  Multiply that by 180 days in a baseball season and that gets ugly.

Next step was figuring out how to use parallelism and eliminate waste in loops.  What I found was quite amazing as to how simple the solution to this problem was.

The Bug

Now let’s get into some inside baseball perl script nerd talk.

The game results for each pair type gets pushed into an array.  These arrays can often hold data from more than 10,000 games each.   The beginning of the script reads the entire 100K game dataset and populates two array hashes %lstierlookup and %lrtierlookup , keyed by  tier pair.

Later in the script when it has to do a lookup it must index an array like this:

$myindex = int rand ( $#myarray ) ;
$mylookup = @myarray[ $myindex ] ;

The variable $myindex is a random integer the size of the array so it ends up indexing a random item in @myarray which returns variables from a real game in the dataset.  Not too complicated.

Often when I can’t figure out syntax to something I use the Keep It Simple Stupid method and use a work around.  I couldn’t figure out the syntax to index an array hash so I did this.

my @lshomearray = @{$lstierlookup{$homerlstierkey}} ;
my @lsawayarray = @{$lstierlookup{$awaylstierkey}} ;
my @lrhomearray = @{$lrtierlookup{$homerlrtierkey}} ;
my @lrawayarray = @{$lrtierlookup{$awaylrtierkey}} ;

Now the array can be easily indexed using syntax shown above.  The perl code however actually transferred the entire 10K+ items in each array to the other array.  This was in the iteration loop which means 1000 iterations meant 4000 array transfers.   The Keep It SImple Stupid method of being lazy not looking up the proper syntax trumped the Keep It Simple Stupid when if comes to CPU cycles required to process the loop.

Now 1,000,000 iterations takes about 10 seconds and has virtually no error due to simulation.  I can run the same sim 100 times and virtually no variation in the outcome.

The proper syntax for indexing a hashed array is:

my %hasharray = () ;
my $indexed_value = $hasharray{$mykey}[$myindex] ;

And that was all there was to it.  More detail on these simulations will be explained in subsequent parts to this series.  The above is the gist of how the win/loss percentages are calculated.

There is a problem with strict tiering.  For example, two Tier 3 players could be on opposite ends of their boundaries or a Tier 4 and a Tier 3 player could be almost equal in value, only separated by an arbitrary boundary.  This problem has been solved using a somewhat different method.

The four pairs (home and away, l-s, l-r) and what those distributions store and return to determine wins and losses remain the same as described above.  All of these simulations rest upon the foundation of this data model and how it assigns value to players and groups of players.  More on this later.  Until then ….