Historical Baseball Data Part 1

We talk a lot about home field advantage here with respect to the data coming from the betting markets.  The 0.540 home field advantage number that has been touted here forever is not entirely accurate.  If you add up all the home game wins since 1970 and divide that by total number of games it comes to 0.540 exactly with away winning 0.460.  It is a little more complicated than that and we have been working on scripts to parse this data a little differently.  This may be a very long multi part series.

The Data

In order to do this study we used game logs from retrosheet.org which is a fantastic site for historical baseball data.  You too can download this data here.  Scripts had to be written to parse the comma delineated fields and then compute wins, losses, and everything else on a per season basis.  It was chosen to go back to 1970 which is almost 50 years of baseball games or over 100,000 games.  This should be enough for now but we could go back to 1911 if necessary.  The more data the more accurate our findings will be.

One question I had was how often does a 0.600 team win against a 0.400 team?  The answer isn’t very obvious.  A 0.600 team won 60% of their games against  team of all caliber.  Intuitively you could guess that a 0.600 team playing a 0.600 would win 1/2 the time since they’re both of equal status.  Good teams beat up on bad teams and lose or play equal to better teams.

First let’s look at win loss records using a team’s real WAA.  As has been shown here many times, WAA = Wins – Losses.  It’s as simple as that.  A team that is 60 – 80 for the season has a WAA = 60 -80 = -20.   That is a 100% accurate measure that the commissioner of MLB uses to determine who goes to the playoffs.

When two teams play each other that game has a deltaWAA where

deltaWAA = abs ( WAA(home team) – WAA(away team) )

Not too complicated!  abs means absolute value which means if the subtraction is negative we reverse it to positive.  We only want to know the difference.  When the Cubs started their series with the Diamondback which we described here, ARI had a WAA=+15 and the Cubs had a WAA=+8 so

deltaWAA = abs ( 8 – 15 ) = 7

Arizona has the better record and higher WinPct.  Logically if no other information is known about this game one would guess they would be favored.

We ran numbers for the last 100,000+ games since 1970.  March and April games were excluded because there’s too much volatility at the beginning of a season.   This reduces the number of games to around 86,000.  Below is the table of  results.

Category # Games Total % Home % Away %
1-3 11583 0.508 0.507 0.508
4-6 11239 0.542 0.543 0.541
7-9 10064 0.565 0.565 0.565
10-12 8782 0.582 0.587 0.575
13-15 7563 0.599 0.589 0.610
16-18 6462 0.616 0.617 0.614
19-21 5261 0.619 0.616 0.621
22-24 4596 0.631 0.626 0.636
25-27 3729 0.643 0.627 0.659
28-30 3109 0.639 0.648 0.630
31-33 2490 0.648 0.646 0.651
34-36 2170 0.649 0.644 0.655
37-39 1651 0.676 0.674 0.678
40-42 1327 0.664 0.668 0.660
43-45 1041 0.664 0.644 0.686
46-48 3886 0.711 0.694 0.731

Update 8/4/2017There was a bug in the script used to generate the above table.  It  should be correct now.  The 0.533 WinPct for the example game used below is 0,565.  The bugged script tossed March and April games but didn’t calculate their wins and losses for a team’s win loss record for those months.  Everything looks more balanced now.

In order to make more sense of this data we clumped deltaWAAs  into groups of 3 which is shown in the category column.  The # Games shows how many games were evaluated for that category and for this study we show a WinPct for all games (Total) and for Home and Away.  If you do a quick scan of this you’ll notice home field advantage and away field disadvantage disappears with higher deltaWAAs.  You’ll also notice that as deltaWAA get higher the number of games decreases.  Most games are relatively evenly matched and those blowout games are rare.  Because there are so few games ( <1000) in some categories the WinPct calculation will have more error.

In the Cubs/Diamondbacks example above there was a deltaWAA of 7 which corresponds to the row colored in tan.  The team with the higher WAA (Arizona) has a 0.533  0.565 winning percentage for that game according to the historical data.  Arizona lost that game but that doesn’t prove or disprove anything.

If you wanted to know the probability of flipping a coin you could flip it 1000 times, 10,000 times, etc. and the more flips the more it will converge to what we know is the true underlying probability of 0.500 or 1/2.  It is possible however to change that probability by controlling aspects of the flip like using a mechanical flipper in a vacuum.  Then the underlying probability is not 1/2 anymore.  The more controlled the flip the more you can make it heads every time.

A baseball game is far more complicated than a coin toss but as with a coin toss, the more knowledge one has about the game the better one can estimate the true underlying probability.  The betting markets do a very good job because they’re driven by lots and lots of people with intricate knowledge of every game.  They aren’t always correct and sometimes they are way off.

We follow the Cubs here because we’re Cubs fans and it’s beneficial  to talk through this data.  In future parts to this series we’ll dive down into the WAAs generated by this model.  What are the WinPct for games using deltaWAAs for starting pitching?  When Lester pitched yesterday the deltaWAA for starting pitching was around 1.5 in favor of the Cubs.  According to the table in the next part the Cubs would have a 0.530 WinPct in that situation.  We’ll also explain the 0.600 team vs. 0.400 team example in a future episode.

It starts getting more and more complicated as we travel down this rabbit hole.  In the next part of this series we’ll show tables for starting pitchers and lineups which represents differences in PITCH and BAT for each team.   Cubs play another game with ARI tomorrow and have to face Zack Grienke who is having  career year again.  Ouch!  Until then….