New COVID-19 School Indicators

The CDC recently updated its school opening indicators. It’s really just a judgment call on how much transmission is happening in the communities. Since the data presented on the Hawaii DOH website isn’t easy to compare directly to the CDC numbers, I calculated the numbers myself.

I downloaded the data from the “How is Hawaii Doing at Flattening the Epidemic Curve?” Tableau figure on the Hawaii DOH COVID-19 data site. It has daily counts by county and the positive and total tests.

I like how this dataset requires pretty minimal processing. I did relabel the variables in Excel just to not have to deal with variables that have spaces in their names.

library(tidyverse)

## ── Attaching packages ────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

dat <- read_csv("../datasets/hawaiicovid20210221.csv")

## Parsed with column specification:
## cols(
##   County = col_character(),
##   Date = col_character(),
##   NewCases = col_double(),
##   NewPositiveTests = col_double(),
##   TotalTestEncounters = col_double()
## )

dat

## # A tibble: 1,771 x 5
##    County Date    NewCases NewPositiveTests TotalTestEncounters
##    <chr>  <chr>      <dbl>            <dbl>               <dbl>
##  1 Hawaii 2/15/20        0                0                   0
##  2 Hawaii 2/28/20        0                0                   0
##  3 Hawaii 3/2/20         0                0                   0
##  4 Hawaii 3/3/20         0                0                   1
##  5 Hawaii 3/6/20         0                0                   0
##  6 Hawaii 3/7/20         0                0                   0
##  7 Hawaii 3/8/20         0                0                   0
##  8 Hawaii 3/9/20         0                0                   0
##  9 Hawaii 3/10/20        0                0                   0
## 10 Hawaii 3/11/20        0                0                   0
## # … with 1,761 more rows

dat$County <- factor(dat$County)
dat$Date <- mdy(dat$Date)

I added in the 2019 census estimates for county population size, with a Missing category since there was one in the dataset.

county_pops <- data.frame(County = c("Hawaii", "Honolulu", "Kauai", "Maui", "Missing"), 
                          pops = c(201513, 974563, 72293, 167417, NA))

Ok, now for the calculation. I first added county population to the dataset. Then I grouped the data by county, calculated a cumulative sum and the cumulative sum from 7 days before. I took the difference to get the 7 day total cases. Then I divided that sum by the population of the county, and multipled by 100,000 to get the CDC metric. To make it easy to see the trends, I plotted the data and added horizontal lines at each threshold value.

dat %>% left_join(county_pops, by = "County") %>%
  group_by(County) %>%
  arrange(Date) %>% 
  mutate(total_cases = cumsum(NewCases)) %>%
  mutate(cases_7d_ago = lag(total_cases, 7, default = 0)) %>%
  mutate(cases_7d_sum_per100k = (total_cases-cases_7d_ago)/pops*100000) %>%
  filter(Date > "2020-12-31") %>%
  ggplot(aes(x = Date, y = cases_7d_sum_per100k, color = County)) +
  geom_line() +
  labs(title = "Total Cases Per 100,000 in Last 7 Days", 
       subtitle = "CDC Indicators to Inform Decision Making") +
  ylab("Total cases over last 7 days per 100,000 population") +
  geom_hline(yintercept = 9.5, linetype = "dotted", color = "blue") +
  geom_hline(yintercept = 49.5, linetype = "dotted", color = "yellow") +
  geom_hline(yintercept = 99.5, linetype = "dotted", color = "orange")

## Warning: Removed 49 row(s) containing missing values (geom_path).

I also needed the test percentage positive over last 7 days. That one is basically the same method as the one before but instead of just calculating the cumulative sum of the tests, I had to do the positive and total test sums. Then I got the 7 day prior cumulative sum for both and calculated the difference. I divided the 7 day total positives by 7 day total tests and plotted those results.

dat %>% 
  group_by(County) %>%
  arrange(Date) %>% 
  mutate(total_tests = cumsum(TotalTestEncounters)) %>%
  mutate(total_positives = cumsum(NewPositiveTests)) %>%
  mutate(total_tests_7d_ago = lag(total_tests, 7, default = 0)) %>%
  mutate(total_positives_7d_ago = lag(total_positives, 7, default = 0)) %>%
  mutate(test_positive_rate_7d = (total_positives-total_positives_7d_ago) /
           (total_tests - total_tests_7d_ago)) %>%
  select(County, Date, test_positive_rate_7d) %>%
  filter(Date > "2020-12-31") %>%
  ggplot(aes(x = Date, y = test_positive_rate_7d*100, color = County)) +
  geom_line() +
  labs(title = "Test Positive Rate in Last 7 Days", 
       subtitle = "CDC Indicators to Inform Decision Making") +
  ylab("Test positive rate over last 7 days (%) ") +
  geom_hline(yintercept = 5, linetype = "dotted", color = "blue") +
  geom_hline(yintercept = 7.95, linetype = "dotted", color = "yellow") +
  geom_hline(yintercept = 9.95, linetype = "dotted", color = "orange")

In retrospect, I think I could have done this without calculating the cumulative sum, just using the lag. I would have to create a function for that…

lagsum <- function(x, lag_start, lag_end) {
  require(dplyr)
  total = 0
  for(i in lag_start:lag_end) {
    total <- total + lag(x, i, default = 0)
  }
  return(total)
}

And it works!

dat %>% 
  group_by(County) %>%
  arrange(Date) %>% 
  
  # helper function lagsum method
  mutate(total_cases_7d_lagsum = lagsum(NewCases, 0, 6)) %>%
  
  # cumsum method
  mutate(total_cases = cumsum(NewCases)) %>%
  mutate(cases_7d_ago = lag(total_cases, 7, default = 0)) %>%
  mutate(total_cases_7d_cumsum = (total_cases-cases_7d_ago)) %>%
  
  # compare results
  select(County,Date, total_cases_7d_lagsum, total_cases_7d_cumsum) %>%
  filter(Date > "2021-02-14")

## # A tibble: 20 x 4
## # Groups:   County [5]
##    County   Date       total_cases_7d_lagsum total_cases_7d_cumsum
##    <fct>    <date>                     <dbl>                 <dbl>
##  1 Hawaii   2021-02-15                    14                    14
##  2 Honolulu 2021-02-15                   232                   232
##  3 Kauai    2021-02-15                     0                     0
##  4 Maui     2021-02-15                    79                    79
##  5 Missing  2021-02-15                     0                     0
##  6 Hawaii   2021-02-16                    18                    18
##  7 Honolulu 2021-02-16                   215                   215
##  8 Kauai    2021-02-16                     1                     1
##  9 Maui     2021-02-16                    67                    67
## 10 Missing  2021-02-16                     0                     0
## 11 Hawaii   2021-02-17                    15                    15
## 12 Honolulu 2021-02-17                   205                   205
## 13 Kauai    2021-02-17                     1                     1
## 14 Maui     2021-02-17                    64                    64
## 15 Missing  2021-02-17                     0                     0
## 16 Hawaii   2021-02-18                    15                    15
## 17 Honolulu 2021-02-18                   193                   193
## 18 Kauai    2021-02-18                     1                     1
## 19 Maui     2021-02-18                    65                    65
## 20 Missing  2021-02-18                     0                     0

After googling some more, I discovered the RcppRoll::roll_sum function. Anything with Rcpp in its name is intimidating, so for now, I’ll just use my homemade function.