Justin Smoak and BABIP Trends

The podcast I listen to said something like, “I bet Justin Smoak must have one of the lowest BABIPs of anyone to this point in his career.” I thought it might be a good chance to explore this nebulous BABIP a little, what are the historical league and career trends in BABIP. Maybe 3 sentences are needed.

Or maybe a second paragraph. Or 2.

library(Lahman)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(hrbrthemes)

BABIP is calculated as balls in play (hits minus home runs) divided by total at bats that caused these balls in play (at bats minus strikeouts minus home runs plus sacrifice flies). Note that walks aren’t included because they don’t count against AB and the ball wasn’t put in play.

\[ BABP = {{H - HR} \over AB - K - HR + SF}\]

As part of the exploration I discovered that there were some missing categories for years before 1954. I would assume these were sac flies since the others seem pretty straightforward.

data("Batting")
Batting %>% group_by(yearID) %>% summarize(sum(H), sum(HR), sum(AB), sum(SO), sum(SF)) %>% 
  filter(yearID > 1948 & yearID < 1958)
## # A tibble: 9 × 6
##   yearID `sum(H)` `sum(HR)` `sum(AB)` `sum(SO)` `sum(SF)`
##    <int>    <int>     <int>     <int>     <int>     <int>
## 1   1949    22168      1704     84380      8951        NA
## 2   1950    22559      2073     84823      9554        NA
## 3   1951    22191      1863     85065      9335        NA
## 4   1952    21272      1701     84195     10380        NA
## 5   1953    22459      2076     84997     10213        NA
## 6   1954    21908      1937     83936     10215       795
## 7   1955    21610      2224     83590     10803       698
## 8   1956    21653      2294     83856     11505       644
## 9   1957    21865      2202     84906     11948       687

I charted the league BABIP for all years since 1954 and came up with the figure below. It seems that BABIP has been going up a little and has recently been around .295-.300.

Batting %>% group_by(yearID) %>% 
  # I think I needed all these na.rm because of pitchers who didn't have any ABs their whole career
  summarize(BABIP = (sum(H, na.rm = T) - sum(HR, na.rm = T))/
              (sum(AB, na.rm = T) - sum(SO, na.rm = T) - sum(HR, na.rm = T) + sum(SF, na.rm = T))) %>% 
  filter(yearID>1953) %>% 
  ggplot(aes(x = yearID, y = BABIP)) + geom_point() + geom_smooth(method = "loess") +
  labs(title = "Batting Average on Balls in Play, 1953-2015",
       x = "Year",
       y = "BABIP") +
  theme_ipsum()

Let’s break down the announcer’s statement. What is Smoak’s BABIP? First I determined his playerID then calculated his games played and his BABIP.

data("Master")
Master %>% filter(nameLast == "Smoak") %>% select(playerID, birthYear, birthMonth, birthDay, nameFirst, nameLast)
##    playerID birthYear birthMonth birthDay nameFirst nameLast
## 1 smoakju01      1986         12        5    Justin    Smoak

Here are the by year BABIP numbers.

Batting %>% filter(playerID == "smoakju01") %>% 
  mutate(BABIP = (H - HR) / (AB - HR - SO + SF)) %>%
  select(yearID, AB, BABIP)
##   yearID  AB     BABIP
## 1   2010 235 0.2383721
## 2   2010 113 0.2933333
## 3   2011 427 0.2733119
## 4   2012 483 0.2422535
## 5   2013 454 0.2784810
## 6   2014 248 0.2429379
## 7   2015 296 0.2538860

And the total career to 2015.

Batting %>% filter(playerID == "smoakju01") %>%
  select(H, HR, AB, SO, SF) %>% summarize_all(sum) %>%   
  mutate(BABIP = (H - HR) / (AB - HR - SO + SF))
##     H HR   AB  SO SF     BABIP
## 1 506 92 2256 578 13 0.2589118

This is quite a bit lower than the BABIP in those years (again around .300). Smoak is a first baseman but not a particularly portly one, so I’m not sure that it should be that low.

Smoak has played for about 6 years. How do I filter players from around the same time who have played at least 6 years and calculate their BABIP to that first 6 years? Or maybe it might be easier by cumulative AB.

Batting %>% filter(yearID > 1953) %>%
  group_by(playerID) %>%
  mutate(totalAB = sum(AB), cumAB = cumsum(AB)) %>%
  filter(totalAB >= 2256 & cumAB <3000) %>% select(playerID, yearID, H, AB, HR, SO, SF, totalAB, cumAB) %>%
  group_by(playerID) %>% summarize(totH = sum(H), totAB = sum(AB), totHR = sum(HR), totSO = sum(SO), totSF = sum(SF)) %>% 
  mutate(BABIP = (totH - totHR) / (totAB - totHR - totSO + totSF)) %>% arrange(BABIP)
## # A tibble: 1,302 × 7
##     playerID  totH totAB totHR totSO totSF     BABIP
##        <chr> <int> <int> <int> <int> <int>     <dbl>
## 1  crandde01   636  2542   117   266    24 0.2377462
## 2  hernaen01   522  2327     2   151     8 0.2383135
## 3  grosswa01   676  2906   110   448    24 0.2386172
## 4  duncada01   617  2885   109   677    18 0.2399622
## 5   kingji01   699  2918   117   401    22 0.2402973
## 6  nettlgr01   623  2560   105   326    15 0.2416045
## 7  blefacu01   699  2947   112   444    25 0.2429636
## 8  martibu01   618  2743    58   419    35 0.2433724
## 9  mcgwima01   647  2656   178   592    37 0.2438898
## 10 brinked01   551  2644    28   490    14 0.2443925
## # ... with 1,292 more rows
# %>%
#   filter(playerID == "aaronha01")

Other data: BABIP by position BABIP by year of career BABIP by age