Category Archives: Information

Cubs Clinch

This blog has been offline for awhile while working on the baseball-handbook site which is starting to take shape. The purpose of this public log book was to ferret out table formats and test the code that produces them from data sources. All tables produced by this code had to be manually copied and pasted into these WordPress posts. All this code was reused to quickly put together the clickable site that provides player, team, and handicapping information for any team, any player, any game since 1900 in a (hopefully) easy to use format.

Ironically the Cubs clinch a playoff spot almost the same time in current year 2020 as they did in 1984. This year there was little to no fanfare.

Last night in 1984 when the Cubs beat the Pirates 4-1 clinching NL East Cubs fans celebrated as intensely as they did in 2016 after winning a World Series. Unlike in 2016 however, Cubs had to play another game the next day on September 25.

Cubs lost that game 7-1 with all their regular players nursing hangovers or something. Reggie Patterson who previously pitched 1 inning in 1984 started and Cubs lineup went from a typical tier 4.00 to a tier -1.97 today.

Lineups are dynamic from game to game so that would be reflected in the handicapping. With only one inning pitched, Patterson would disqualify simulation for this game and handicapping would be a discard. One of the problems with simulation is it only takes into account data model factors and not other influences like most players not playing with their normal competitive mindset. Factors like this would be difficult if not impossible to model mathematically.

Cubs were cruising to the playoffs after our matchup post over a month ago so it was only a matter of time before it became official. Memories of the 1969 season lingered but collapses like that are unusual and considered anomalies with low probability of occurrence.

Dredging up 1984 was in response to no baseball this year. Since then MLB has concocted an exhibition season which is coming to a close in a couple of days. Cubs and White Sox are in post season and both teams have interesting profiles.

By calendar days, if this were a normal year, we would be at the beginning of June. A normal baseball season is a marathon enveloping summer. This year it is a sprint. This data model right now has just enough data to start handicapping and the season is almost over.

We’ll cover both 2020 and 1984 playoffs with commentary and links into baseball-handbook data which are already up for 1984 playoffs.

The days of posting Cubs matchups and status here are over as that is all done automatically with more colorful tables for any team you choose. This log book will still stay active for various coding rants as the handbook site develops and becomes an app for that. There’s also an issue with WAR and 1969 Ernie Banks which is rather peculiar. Until then ….

Baseball-Handbook.com

The very first draft of baseball-handbook.com is up and running. This will be a work in progress from now until eternity but the next few months will show the greatest improvement.

Right now careers are working well. The Today tab which will show current data will take awhile before we actually get enough current data. It is currently evaluating lineups, starters, and relief using 2017,2018, and 2019 cumulative data.

We’re currently 23 days into this season with 264 games played, 39 not played, with 17 of those involving the Cardinals. This is not normal. It is usually assumed most teams and players are around the same playing time which isn’t the case this season and could affect the way these averages work — still not sure.

The Today tab currently lists all current Vegas odds with team strengths for each game and soon will have links to rosters and all games played this season so far and to come. Stay tuned as this site develops over the next few months.

The 1984 Houston series was skipped and we’ll resume when the Cubs host the Reds on 8/17. Until then ….

The 2020 MLB Season

Back to the future we travel to the year 2020 when MLB starts its season tomorrow July 23 with two night games and then the next day the rest of the league starts their season. The season ends September 27 with only 60 games scheduled for each of the 30 teams.

Normally in April we do a playoff horse race using 3 year splits based on opening day rosters of each team when all rosters are known. Rosters are divided into Starters, Relief, Hitters, and then added to make a Total which gets sorted from highest to lowest.

Although most teams in the top third of these lists either make the playoffs or are competitive throughout the season and the bottom third, not so much, there are always exceptions. Cubs and Houston had the two lowest totals at the beginning of 2015 but both teams had young talent with little to no accrued WAA value. The next few years both these teams rose to the top when their young guys started racking up value in three year splits while their teams dominated the league.

This model needs a month of current year data in order to properly rank players. It needs at least 45 days, or 25% of a season, in order to simulate and handicap games. Think of a baseball season like a marathon or the Indy 500. At the start runners and cars are bunched together and as time goes the field spreads out to reveal front runners, middle of the road, and bringing up the rear participants. Even win/loss records of most MLB teams are still bunched up closely by mid May.

Vegas however starts handicapping on day one of a season and probably even bets meaningless pre-season games. In order to handicap on day one the simulator for this model would have to depend upon player data from previous seasons. The simulator currently doesn’t work that way but it is something that should be looked into and perhaps a shortened season like this will provide some insight into how that can be done.

Simulation draws data from complete seasons in the past from teams limited to 25 player rosters. This season will have 30 player rosters and other roster shenanigans that may invalidate comparison with team valuations in games from the past. Currently expanded rosters is a problem for the simulator in Septembers for many reasons that we won’t get into here but is still a work in progress to figure out had this been a normal season. This entire shortened season is similar to a normal September expanded rosters.

A normal baseball season is a 6 month marathon. This 2 month sprint for each team requires different managing strategies. It’s like a marathon race requiring runners to only run 10 miles. Runners will run differently if they know they don’t have to run the entire distance and baseball players will know they can rest a few days off without worrying about IL/DL rules.

In April this model fixes broken scripts from data sources that change their APIs. This will be the first year this model estimates current year rosters which will be a lot more accurate than past years when they were being downloaded from Wikipedia. Wikipedia was a good source but sometimes lagged a few days.

We’ll be getting detailed box scores from the mlb.com api which will allow for a much more accurate valuation of relief squads than in past seasons. MLB.com switched their API from XML to JSON this season. The code to decipher JSON was being worked on in March when everything in the world went haywire.

Like in past Aprils, after analyzing all opening day rosters there really isn’t much to discuss for the next few weeks when we get a handle on what players and teams are breaking out this year . Right now I don’t even know what teams are in the Cubs’ division or even how playoff matchups get chosen.

Speaking of Cubs. While testing the JSON API from mlb.com using a recent Cubs White Sox game it appears the Cubs are 8-14 and the White Sox are 12-7 before Cubs lost that game. When this season starts and we have full rosters we’ll cover both White Sox and Cubs. We knew White Sox would be contenders this season but not sure about Cubs. Can’t really tell without knowing exactly who is on each team however which we’ll know in a few days.

The 1984 Cubs season will continue unabated until the very tragic end late in October. Meanwhile, Cubs spilt a 4 game set with the last place Giants losing money for Cubs bettors and it’s on to Philadelphia for a three game series with Rick Sutcliffe on the mound tomorrow. Until then …

Update 4/24/2019

Under normal conditions we would be one week away from showing player rankings. Rosters would be available to talk about and Part 1 of playoff horse race would have already been published using 3 year split data.

Unfortunately none of that has happened or most likely will happen this year. It’s difficult to work on this when there isn’t a stream of live data to deal with. Off season projects of rebooting the simulator and moving everything into a formal database is almost complete — albeit nothing ever completes in these kind of projects.

The White Sox looked to have a very good team this year according to this data model based upon some of their off season moves. Since we don’t have roster data there isn’t any way to measure and show that.

As for the Cubs, I don’t know. Right about now we would be doing the first Cubs status for this season. If they get baseball season started at training camps we’ll start collecting data and doing reports. Until then ….

Simulation Reboot Part 3

In order to test the integrity of the database used in simulation we need to run tests.  Without accurate data or bugs in scripts the estimated probability it produces is inaccurate.  In this part we’ll look at Tier Combo data from three baseball eras; 2000-2019, 1980-1999 and 1950-1969 to test the integrity of this data model.

Real team WAA, using real team wins and losses, the only stat in baseball that determines who makes the playoffs, was tiered in Part 2 of this series.  There is no dispute over real team WAA but there may be dispute over how this data model calculates it for players.  This exercise will deomnstrate if the player WAA and theories espoused by this data model has any merit.

A baseball season is much like any long race like a running marathon, Tour de France, or Indy 500.   Everyone is equal at start and as the race proceeds contestants become more and more separated where winners and losers and those mediocre become more and more defined.

Real team WAA is simply wins – losses.  This data model calculates and assigns WAA to players where the sum of WAA for all players on a team equals that team’s win/loss record.  In April and much of May not only are teams more bunched together with real team WAA, so are players making tiering much more error prone.  This model doesn’t start handicapping now until day 60 which is around third week in May nowadays.  This allows for standard deviations for lineups, starters, and relief squads used to calculate tiers to increase — meaning teams are separated enough to somewhat determine who is truly good this season and who is not good.

Much like marathons or Indy 500s, teams and players often crash and burn by the end of season.  This model quickly adjusts to reflect that.  Stats like batting averages do not.

There are two types of tier combos used in simulation; lineup -> starter and lineup -> relief.  Each game contains two pairs; one pair for away team and a pair for home team.

Tier combos are calculated by subtracting the pitching component tier number (starter or relief) from the lineup tier number.  Tier numbers are calculated by this simple formula:

Tier Number = 2 * ( WAA – league WAA average ) / league standard deviation

WAA for a lineup is the sum of player WAA for that lineup.  WAA league average and standard deviation is a running average of 30 teams’ last 3 lineups ( 90 lineups ).  A snapshot is taken at the beginning of each day, then averages and tier numbers for each team are calculated.

WAA for starters rely on a single player.  WAA for relief is the sum of a relief squad.   Relief squads are estimated from event data and are pretty accurate.

Tier numbers are floating point numbers.  When subtracted to make a tier combo they get rounded up or down to make an integer.  Right now tier numbers have a range of -4 to +4 and tier combos have a range of -6 and +6.  The simulator only cares about tier combos.

The run used to make the below tables looks at all games between 6/1 and 8/31.  Tiers fluctuate too much in April and May and in September player expansion can distort roster value.  Although we may handicap games in September and late May, we’re sticking to a much narrower window for the dataset simulation draws from.

The below tables show all the tier combo sets from -6 to +6 with columns runs/inning, number of innings pitched per game for both the lineup -> relief and lineup-starter.

First let’s look at the modern era from 2000-2019 which encompasses around 25,000 baseball games from 6/1 to 8/31.

2000 – 2019 Tier Combos

TC Lineup -> Relief Lineup -> Starter
R/Inn Outs R/Inn Outs
-6 0.359 8.76 0.353 19.86
-5 0.391 8.93 0.372 19.40
-4 0.390 8.86 0.409 18.93
-3 0.415 9.07 0.432 18.47
-2 0.424 9.14 0.449 18.19
-1 0.429 9.09 0.473 17.88
0 0.442 9.21 0.488 17.71
1 0.462 9.38 0.512 17.42
2 0.470 9.30 0.514 17.47
3 0.488 9.47 0.534 17.26
4 0.490 9.37 0.546 17.05
5 0.526 9.62 0.585 16.90
6 0.561 9.60 0.600 16.87

Tier Combo of -6 is a terrible lineup facing a very good relief squad or starter.  The opposite is true for a Tier Combo of +6.  The above shows runs per inning for starters goes from 0.353 at TC = -6 to  0.600 per inning at +6, the best lineups vs. worst starter.  Runs per innings increase almost the same with the lineup -> relief combos.

The number of outs for starters goes from 19.86 outs per game with the best starter facing the worst lineups down to 16.87 outs for the worst starter facing the best lineups.  Divide by 3 to get innings.  Outs per game for relief does not vary much between -6 and +6 probably due to the number of outs a relief staff must pitch has more to do with the starter than the value of the relief squad.

The number of runs given up by relief is much less than by starters which should be expected.  Tier Combo 0 is even steven between lineups and relief or starter.  The starter runs per inning is almost exactly league average for this 20 year span.

All runs counted for pitchers above are earned runs.  When determining who wins a baseball game, the commissioner counts unearned runs equally with earned runs.  This model counts and tiers  unearned runs separately for use in simulation because all runs must be accounted for to make the books balance here.  A pitcher should not be blamed for runs not his fault and an official scorekeeper keeps track of that for every play in every game since the beginning of baseball.

The next table will show the 1980 to 1999 era.

1980 – 1999 Tier Combos

TC Lineup -> Relief Lineup -> Starter
R/Inn Outs R/Inn Outs
-6 0.361 8.04 0.355 20.84
-5 0.378 8.40 0.374 20.44
-4 0.392 8.11 0.390 19.99
-3 0.375 8.19 0.409 19.51
-2 0.404 8.10 0.435 19.18
-1 0.416 8.12 0.456 18.87
0 0.418 8.41 0.466 18.77
1 0.449 8.27 0.477 18.46
2 0.465 8.49 0.500 18.37
3 0.460 8.24 0.503 18.25
4 0.458 8.67 0.536 17.92
5 0.504 8.63 0.532 17.92
6 0.495 8.92 0.553 18.03

The league had 26 teams for most this era and went to 30 teams in 1998 which means less pitchers.  A 30 team league will have around 150 starters, a 26 team league 130.  The above shows much narrower differences between -6 and +6 tier combos for both relief and starter which should be expected because talent is more concentrated.

This can be a problem in simulation that is still a work in progress.  As we go back to 1950-1969 we get to 16 team leagues with around 1/2 the number of players.  It may not be possible without some kind of adjustment to pull values from a tier combo in a 24 or 16 team league when we’re handicapping a 30 team league with much higher disparity of talent.

As we go back in time starters pitch more outs and relief less.  This means we can’t simply pull a pitchers innings pitch/earned runs from an early era and use that directly in simulation either.

Below is a look at the Tier Combo spread from 1950 to 1969.

1950 – 1969 Tier Combos

TC Lineup -> Relief Lineup -> Starter
R/Inn Outs R/Inn Outs
-6 0.320 7.57 0.325 22.06
-5 0.361 6.99 0.344 21.50
-4 0.358 7.14 0.356 21.13
-3 0.365 6.87 0.380 20.48
-2 0.385 7.28 0.389 20.19
-1 0.414 7.22 0.399 19.97
0 0.416 7.23 0.416 19.78
1 0.433 7.69 0.439 19.35
2 0.425 7.72 0.443 19.21
3 0.478 7.84 0.458 19.04
4 0.493 7.92 0.479 19.03
5 0.485 8.41 0.504 18.35
6 0.560 8.63 0.502 18.65

The above are averages.  When looking at % of 9 innings pitched by starters it skyrockets almost an order of magnitude (10x)  higher than modern era baseball.  Runs/inning are even more constricted with mostly 16 team leagues.

In past years this data model pulled data from 1970 – present without any alteration.  This probably introduced error even though it beat Vegas albeit not by enough to advertise.

Adjustments will have to be made on an era by era basis.  There is too much variation to come up with factoring coefficients on a yearly basis.  The eras shown above were thrown together arbitrarily to fit with the logistics of rebuilding this database.  Right now I’m thinking 1920-1960, 1961-1976, 1977-1997, 1998 -2019.

The biggest factor in narrowing Tier Combo results is number of players in a league which is directly related to number of teams.  1961 – 1977 went from 20 teams to 24.  The next era went from 26 to 28, and our modern era since 1998 has been at 30 teams.

The number of innings starters pitched has also declined a lot in recent years but that’s fodder for another post.

Looks like baseball season might be cancelled  <insert sad emoji>.  This model was going to get detailed box scores from mlb.com this season which would have made regular season handicapping much more interesting as roster value — especially relief, will be far more accurate than past seasons.  Unfortunately we may have to wait until next year.

Still working this simulation and the baseball-handbook.com website which will allow easy click through for any team, any player since 1900 and any game since 1920.  Until then ….