More Useless Stats

April is a dead zone for this data model as player stats cannot be accurately compiled until May and team stats around mid April.  This is because there are huge fluctuations throughout the league rending deceiving results.  That doesn’t stop certain Cubs announcers from rattling off meaningless team slash lines that make no sense (Hi JD!)  but whatever.  He just reads what some stat heads write on a cue card thinking it adds value to the color commentary.  it doesn’t.  But I digress ….

In the meantime let’s bide some time and waste it on even more useless stats.  The other day at the local pub a person who played at a pretty high level of baseball mentioned an interesting theory; he said pitchers  throw more strikes on the first pitch hoping the batter will be taking.  Is this true?

Since we have event data from retrosheet.org that show pitch sequences back to 1988 this is something that can be either proven or disproven.  First a counting pitches script needed to be written.  In order to not get too crazy, only years 2015 – 2018 were processed which should be enough.

My first question when writing this script was what are the average pitch count per batter.  This comes to almost exactly 4.  Next I made these calculations:

  • Average # pitches / STRIKEOUT = 4.9
  • Average # pitches / OUT = 3.5
  • Average # pitches / WALK = 5.9
  • Average # pitches / HIT = 3.5

Whether batter gets an OUT or a HIT are the same at 3.5 pitches / batter.  Pitchers who like to throw strikeouts add another 1.4 pitches / batter to their pitch count and that goes up another whole pitch if they walk a batter.  Since many Sabermetric stats demand pitchers throw strikeouts this can spike pitch counts for no reason other than a pitcher needs to game FIP for his next contract or help his Draft Kings teams win.

More Fun With Numbers

The following table shows 4 event types and the percentages they occur based upon pitch number and type of pitch.  Row 1 B means first pitch Ball, then what happens.  C means Called Strike, F means Foul ball, S means swinging strike.  The percentages of columns in each row must add to 1.  More explanation below the fold.

Type STRIKEOUT OUT WALK HIT
1 B 0.179 0.447 0.141 0.233
1 C 0.293 0.444 0.049 0.214
1 F 0.295 0.436 0.046 0.222
1 S 0.362 0.389 0.052 0.196
2 B 0.215 0.410 0.164 0.211
2 C 0.339 0.395 0.070 0.195
2 F 0.358 0.392 0.054 0.196
2 S 0.447 0.338 0.053 0.162
3 B 0.266 0.351 0.203 0.179
3 C 0.396 0.335 0.106 0.163
3 F 0.378 0.368 0.069 0.185
3 S 0.624 0.214 0.059 0.104
4 B 0.271 0.299 0.277 0.153
4 C 0.408 0.249 0.215 0.128
4 F 0.378 0.358 0.086 0.177
4 S 0.790 0.107 0.050 0.053

The above shows the first 4 pitches which is almost exactly a per batter league average pitch count.  How it came to exactly 4 is as  fascinating as how Hits/2 almost exactly equals runs scored.

The second pitch above does not care what happened in the first pitch.  Ditto for pitches 3 and 4.  You would need to do some conditional probability to figure out anything in more detail and whether that would be worthwhile — it’s probably not worthwhile.

This post last April showed that MLB average Batting Average was 0.255 for all batters between 2010 and 2017.  The above Hit % is not a batting average as it uses Plate Appearances (PA) instead of At Bats (AB) as a divisor.  For this exercise using PA is a more accurate and less confusing measure.

Scanning this table you’ll see both Hits and Walks are most likely when a Ball is thrown which seems intuitively obvious.  Not sure how useful any of the above data is other than gaining an advantage on a bunch of friends at a game who like to bet on every pitch and batter.

The last table is the crux of this entire study.  Aside from Pitch number there are 5 categories of things that can happen listed:

  • SWING – batter swings and misses
  • CALLED – called strike
  • FOUL – batter makes contact hits a foul ball
  • BALL – ball
  • CONTACT – batter puts ball in play ( out of hit )

Swing , Foul, and Contact are don’t know because we don’t know if the ball was in the strike zone when that happened.  We know a Called strike was in the strike zone and a called Ball was not.  This table shows pitches 1 through 4.  All % columns in each row must add to 1.

Pitch Swing Called Foul Ball Contact
1 0.066 0.321 0.104 0.397 0.112
2 0.107 0.165 0.167 0.386 0.174
3 0.120 0.118 0.192 0.375 0.195
4 0.124 0.111 0.206 0.352 0.207

If a batter lays back and does nothing the above suggests it’s more likely the pitch will be a ball instead of a called strike.  Called strikes on the first pitch are almost double that of subsequent pitches so my friend does have a point.  Pitchers do throw more accurately on the first pitch compared to all the others.

Clarification 4/16/2019:  The above statement is wrong.  Batters may tend to lay back on the first pitch which is why Called Strikes are so high.  A high percentage of Swings and Fouls  would be called strikes.  What that percentage is we can’t tell from event data.  The radar guys keeping track of every pitch thrown would know.  Pitches in the strike zone and out of the strike zone could be estimated by estimating this probability.  It appears Called and Balls  would be somewhat equal according to the above table.

That is all for now.  Cubs are having a tumultuous April.  We can do a brief CHN team status with no player rankings in perhaps  7 – 10 days.  Hopefully things settle out for them.  This model had CHN with the highest Total value in MLB based upon 2016, 2017, and 2018 splits.  This should be expected since the Cubs have had an incredible run of winning these last three years and most of the players who racked up those wins are still on the team.

As always, past results do not affect future results and we really witnessed that these last 10 games even though they won their home opener today 10-0.  No one can predict the future and anyone making claims that they did with respect to the Cubs are lying.  The only thing this model can do is provide an accurate view of the past.  Most other systems can’t even do that.

Through our view of current year data starting in May when 1/6 of the season is in the books and many players have more than 100 PAs, we can use current year data to estimate a handicap for  single upcoming games.  We can’t estimate what will happen for the next 133 games because that is impossible.  Things change daily and weekly and, as we have shown here over an over, stats like BA, OPS, WAR, etc. etc. do not react to changes as quickly as this model does.  That’s our advantage which is why we beat ELO by over 10% last two seasons, Vegas by 10% in 2018 but only 2.5% in 2017 ( this is currently being looked into ).  More on this later.  Until then ….