The local newspaper released a dataset of high school all stars from 1972 to 2017, and I thought it might be interesting to analyze the information. Some interesting questions were what schools have the most all stars, at what positions, and how this has changed over the years. I suspect that Saint Louis, Kahuku, and Waianae will be the top teams but who knows? Maybe there’s a secret dynasty that I never heard of.
First I loaded the packages I would need.
library(googlesheets)
suppressPackageStartupMessages(library(dplyr))
library(ggplot2)
I loaded the dataset from my Google Drive. I had a little issue with an error message but it turned out that there was a zombie R instance using the port and a quick kill
was all that was needed. Method for that came from this post.
my_sheets <- gs_ls()
fb <- gs_title("HI_football_allstars")
## Sheet successfully identified: "HI_football_allstars"
x <- gs_read(fb)
## Accessing worksheet titled 'Sheet1'.
## Parsed with column specification:
## cols(
## year = col_integer(),
## position = col_character(),
## first = col_character(),
## last = col_character(),
## team = col_character(),
## poy = col_character(),
## paper = col_character()
## )
First I looked at the variables and made the positions and paper names into factors.
glimpse(x)
## Observations: 2,014
## Variables: 7
## $ year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,...
## $ position <chr> "RB", "QB", "REC", "REC", "REC", "RB", "OL", "OL", "O...
## $ first <chr> "Enoch", "Chevan", "Mitchell", "Jonah", "Scott", "Kah...
## $ last <chr> "Nawahine", "Cordeiro", "Quinn", "Panoke", "McLeod", ...
## $ team <chr> "Kahuku", "Saint Louis", "Saint Louis", "Saint Louis"...
## $ poy <chr> NA, "OPOY", NA, NA, NA, NA, NA, NA, NA, NA, "COY", NA...
## $ paper <chr> "HSA", "HSA", "HSA", "HSA", "HSA", "HSA", "HSA", "HSA...
x$position <- factor(x$position)
x$paper <- factor(x$paper)
Historically there were two newspapers in Honolulu, the Honolulu Star-Bulletin and the Honolulu Advertiser. These newspapers published separate all star lists until they merged in early 2010 and became the Honolulu Star-Advertiser.
I calculated the size of the all star teams over the years.
# players over years
x %>% mutate(one = 1) %>% group_by(year, paper) %>% summarize(nplayers = sum(one)) %>%
ggplot(aes(x = year, y = nplayers, color = paper)) + geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).
Looking at the data there’s one year that’s an outlier (and one NA observation). The 1975 list is odd in that there are 12 defensive linemen and 7 linebackers chosen for the team. I’d guess that means they published first and second team players in this list, but it’s not clear who would have been first team and who second team.
x %>% filter(year == 1975)
## # A tibble: 54 x 7
## year position first last team poy paper
## <int> <fct> <chr> <chr> <chr> <chr> <fct>
## 1 1975 coach Cal Chai Kamehameha <NA> HA
## 2 1975 E Casey Kunitomo Kamehameha <NA> HA
## 3 1975 E Jim Gesser Kaiser <NA> HA
## 4 1975 E Ben Apuna Waipahu <NA> HA
## 5 1975 E Derek Yamada Iolani <NA> HA
## 6 1975 E Aaron Lorenzo Kamehameha <NA> HA
## 7 1975 E Charles Pico St. Anthony <NA> HA
## 8 1975 E David Corbelli Hilo <NA> HA
## 9 1975 T Dayle Turner Kamehameha <NA> HA
## 10 1975 T Robert Hema Waianae <NA> HA
## # ... with 44 more rows
x %>% filter(year == 1975) %>% group_by(position) %>% summarize(n())
## # A tibble: 10 x 2
## position `n()`
## <fct> <int>
## 1 C 2
## 2 coach 1
## 3 DB 6
## 4 DL 12
## 5 E 7
## 6 G 4
## 7 LB 7
## 8 QB 2
## 9 RB 8
## 10 T 5
The NA year is just one row of all NA entries so I’ll just go ahead and drop it.
x %>% filter(is.na(year))
## # A tibble: 1 x 7
## year position first last team poy paper
## <int> <fct> <chr> <chr> <chr> <chr> <fct>
## 1 NA <NA> <NA> <NA> <NA> <NA> HSA
which(is.na(x$year))
## [1] 28
x <- x[-which(is.na(x$year)),]
Which School Has the Most All Stars
For a first pass I looked at the raw number of times each school was listed. This list included duplicates, for example when Mark Tuinei was selected as an all star by both the Star Bulletin and the Advertiser in 1977.
# school with most players (not deduplicated)
x %>% group_by(team) %>%
summarize(nplayers = n()) %>%
arrange(desc(nplayers))
## # A tibble: 45 x 2
## team nplayers
## <chr> <int>
## 1 Saint Louis 422
## 2 Kahuku 278
## 3 Kamehameha 174
## 4 Waianae 149
## 5 Punahou 133
## 6 Farrington 119
## 7 Leilehua 93
## 8 Iolani 89
## 9 Pac-Five 63
## 10 Castle 47
## # ... with 35 more rows
After accounting for the duplicated players, Saint Louis had the most first team players, followed by Kahuku. The third place team (Kamehameha), had fewer than half the number that Saint Louis had.
x %>% mutate(name = paste(first, last)) %>%
group_by(year) %>%
distinct(name, .keep_all = T) %>%
group_by(team) %>%
summarize(nplayers = n()) %>%
arrange(desc(nplayers))
## # A tibble: 45 x 2
## team nplayers
## <chr> <int>
## 1 Saint Louis 281
## 2 Kahuku 197
## 3 Kamehameha 129
## 4 Punahou 108
## 5 Waianae 104
## 6 Farrington 89
## 7 Iolani 71
## 8 Leilehua 69
## 9 Pac-Five 46
## 10 Mililani 41
## # ... with 35 more rows
Maybe a better way to look at it would be an ordered bar chart. Help for the code came from here.
x %>% mutate(name = paste(first, last)) %>%
group_by(year) %>%
distinct(name, .keep_all = T) %>%
group_by(team) %>%
summarize(nplayers = n()) %>%
arrange(-nplayers) %>%
mutate(team = factor(team, team)) %>% # to sort the teams
filter(nplayers > 4) %>%
ggplot(aes(x = team, y = nplayers)) +
geom_bar(stat = "identity") +
# coord_flip() +
labs(title = "Football Teams with the Most All State Players",
subtitle = "1972-2017, minimum 5 all state players per team",
y = "players") +
theme(axis.text.x = element_text(angle=45, hjust = 1))
x %>% mutate(name = paste(first, last)) %>%
group_by(year) %>%
distinct(name, .keep_all = T) %>%
group_by(team) %>%
summarize(nplayers = n()) %>%
ggplot(aes(x=nplayers)) +
geom_histogram(binwidth = 10) +
ggtitle("Distribution of Football All Stars by School, 1972-2017") +
labs(x = "Number of All Stars", y = "Number of Teams")
How Have Team Strengths Changed over Time?
Some schools were powerhouses back in the day but have not been producing all stars lately. Examples include Pac Five and Kaiser. I thought it would be neat to plot the number of all stars per year by team.
# Get rank order of teams
team_rank <- x %>% mutate(name = paste(first, last)) %>%
group_by(year) %>%
distinct(name, .keep_all = T) %>%
group_by(team) %>%
summarize(nplayers = n()) %>%
arrange(-nplayers) %>%
mutate(team = factor(team, team))
# Find teams with at least 10 all stars
min_ten <- x %>% mutate(name = paste(first, last)) %>%
group_by(year) %>%
distinct(name, .keep_all = T) %>%
group_by(team) %>%
summarize(nplayers = n()) %>%
arrange(-nplayers) %>%
mutate(team = factor(team, team)) %>% # to sort the teams
filter(nplayers > 9)
# Create graph
x %>%
# Deduplicate players
mutate(name = paste(first, last), one = 1) %>%
group_by(year) %>%
distinct(name, .keep_all = T) %>%
# Create running count of all-star players by team
arrange(year) %>%
group_by(team) %>%
mutate(cumulative = cumsum(one)) %>%
# Group by team, sort, and filter
group_by(year, team) %>%
summarize(nplayers = max(cumulative)) %>%
arrange(-nplayers) %>%
mutate(team = factor(team, levels(team_rank$team))) %>% # to sort the teams
filter(team %in% min_ten$team) %>% # limit to only teams with > 10 all stars
# Create graph
ggplot(aes(x = year, y = nplayers, color = team)) +
geom_line() +
labs(title = "Cumulative Football All Stars per School",
subtitle = "1972-2017, minimum 10 players",
y = "Number of All Stars")
It turned out that Saint Louis has been dominant for decades but Kahuku has come on strong lately. Kahuku’s rise has correlated with a slight downturn in Kamehameha and Waianae. It’s possible that Kamehameha used to get Kahuku players to come to their school but now they choose to stay home. On the other hand it doesn’t seem likely that Waianae’s players moved to Kahuku to play. Future steps might be looking at the early decades since they’re all crushed together there at the bottom of the chart.
Position U
I hypothesized that some schools might have more all stars at certain positions than other schools. Here’s the list for quarterback.
x %>%
mutate(name = paste(first, last), one = 1) %>%
group_by(year) %>%
distinct(name, .keep_all = T) %>%
group_by(position, team) %>%
summarize(nplayers = n()) %>%
filter(position == "QB") %>%
arrange(-nplayers)
## # A tibble: 13 x 3
## # Groups: position [1]
## position team nplayers
## <fct> <chr> <int>
## 1 QB Saint Louis 19
## 2 QB Leilehua 5
## 3 QB Pac-Five 5
## 4 QB Punahou 5
## 5 QB Iolani 4
## 6 QB Kahuku 3
## 7 QB Waianae 3
## 8 QB Mililani 2
## 9 QB Farrington 1
## 10 QB Kaiser 1
## 11 QB Kamehameha 1
## 12 QB Lahainaluna 1
## 13 QB McKinley 1
Running back was surprisingly different. Farrington came out on top here.
x %>%
mutate(name = paste(first, last), one = 1) %>%
group_by(year) %>%
distinct(name, .keep_all = T) %>%
group_by(position, team) %>%
summarize(nplayers = n()) %>%
filter(position == "RB") %>%
arrange(-nplayers)
## # A tibble: 21 x 3
## # Groups: position [1]
## position team nplayers
## <fct> <chr> <int>
## 1 RB Farrington 16
## 2 RB Kahuku 15
## 3 RB Kamehameha 11
## 4 RB Saint Louis 9
## 5 RB Punahou 7
## 6 RB Waianae 5
## 7 RB Damien 4
## 8 RB Leilehua 4
## 9 RB Mililani 4
## 10 RB Iolani 3
## # ... with 11 more rows
Other positions were pretty much dominated by Saint Louis, just like the regular rankings.
x %>%
mutate(name = paste(first, last), one = 1) %>%
group_by(year) %>%
distinct(name, .keep_all = T) %>%
group_by(position, team) %>%
summarize(nplayers = n()) %>%
filter(position == "LB") %>%
arrange(-nplayers)
## # A tibble: 29 x 3
## # Groups: position [1]
## position team nplayers
## <fct> <chr> <int>
## 1 LB Saint Louis 39
## 2 LB Kahuku 28
## 3 LB Waianae 17
## 4 LB Kamehameha 15
## 5 LB Punahou 14
## 6 LB Farrington 12
## 7 LB Castle 6
## 8 LB Iolani 6
## 9 LB Leilehua 6
## 10 LB Damien 5
## # ... with 19 more rows
What about player of the year? This was pretty messy in how it was coded. Still there were enough OPOY and DPOY that it might be worth looking at.
table(x$poy)
##
## Back of the year Back of the Year
## 3 3
## co dpoy co opoy
## 2 3
## co-opoy coach of the year
## 1 1
## cOY COY
## 1 4
## D POY Defensive player of the year
## 2 1
## Defensive Player of the Year dpoy
## 13 3
## DPOY Lineman of the year
## 38 6
## Lineman of the Year O POY
## 2 1
## Offensive Player of the Year Offensive POY
## 1 1
## opoy OPOY
## 1 57
## Player of the Year So
## 1 1
## soph
## 2
```
x %>%
mutate(name = paste(first, last), one = 1) %>%
group_by(year) %>%
distinct(name, .keep_all = T) %>%
filter(!is.na(poy)) %>%
filter(poy == "DPOY") %>%
group_by(poy, team) %>%
summarize(nplayers = n()) %>%
arrange(-nplayers)
## # A tibble: 8 x 3
## # Groups: poy [1]
## poy team nplayers
## <chr> <chr> <int>
## 1 DPOY Kahuku 8
## 2 DPOY Waianae 4
## 3 DPOY Leilehua 3
## 4 DPOY Kamehameha 2
## 5 DPOY Mililani 2
## 6 DPOY Saint Louis 2
## 7 DPOY Campbell 1
## 8 DPOY Punahou 1
x %>%
mutate(name = paste(first, last), one = 1) %>%
group_by(year) %>%
distinct(name, .keep_all = T) %>%
filter(!is.na(poy)) %>%
filter(poy == "OPOY") %>%
group_by(poy, team) %>%
summarize(nplayers = n()) %>%
arrange(-nplayers)
## # A tibble: 11 x 3
## # Groups: poy [1]
## poy team nplayers
## <chr> <chr> <int>
## 1 OPOY Saint Louis 11
## 2 OPOY Kahuku 5
## 3 OPOY Leilehua 4
## 4 OPOY Punahou 4
## 5 OPOY Iolani 2
## 6 OPOY Kamehameha 2
## 7 OPOY Mililani 2
## 8 OPOY Waianae 2
## 9 OPOY Damien 1
## 10 OPOY Farrington 1
## 11 OPOY Kaiser 1