I joined the Great Virtual Race across Tennessee this summer. It’s by the guys who produce the Barkley Marathons, the crazy nearly impossible ultramarathon series. The goal for this event is to run across Tennessee virtually, a little more than 1000 km. Each day the data table is published as a csv and so I thought I would take a look at what I could do with it.
Some basics:
- The run goes from May 1 to August 31, which is 122 days.
- The distance is 634.84 miles / 1021.68 km
- Daily distance needed to complete the race then is 634.84/122 or 5.204 miles.
- Because I started a little late (5/13), I actually have only 110 days to complete the race or just over 5.77 miles per day.
Data Loading
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))
gv <- read_csv("../datasets/gvrat_20200520.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Position = col_character(),
## `Participant's Name` = col_character(),
## Event = col_character(),
## Home = col_character(),
## G = col_character(),
## KM = col_number(),
## `Your Approximate Location` = col_character(),
## `Comp%` = col_character(),
## `Proj Fin` = col_character()
## )
## See spec(...) for full column specifications.
Exploration
I decided I would try to separate out the tables as name, gender, age, etc. and a separate one for the daily miles since this will quickly get huge.
gv_roster <- gv %>%
select(Bib, name = `Participant's Name`, Event, G, A, Home, Miles) %>%
mutate(Event = as.factor(Event)) %>%
mutate(Home = as.factor(Home)) %>%
mutate(G = as.factor(G))
gv_miles <- gv %>%
select(-c(Position, `Participant's Name`, `Your Approximate Location`,
`Comp%`, `Proj Fin`, KM, Home, G, A, Miles)) %>%
pivot_longer(contains("/"), names_to = "run_date", values_to = "miles_d") %>%
mutate(run_date = mdy(paste0(run_date, "/2020"))) %>%
mutate(Event = as.factor(Event))
Event
There are two events within this race, the dog version and the human version.
DRAT4AS = “Doggie Run Across Tennessee For Animal Shelters” and GVRAT = human version.
table(gv_roster$Event)
##
## DRAT4AS GVRAT
## 295 19016
Gender
In lots of races there are more women than men, and this one is no different.
table(gv_roster$G, gv_roster$Event)
##
## DRAT4AS GVRAT
## F 165 10738
## M 130 8278
prop.table(table(gv_roster$G, gv_roster$Event), 2)
##
## DRAT4AS GVRAT
## F 0.5593220 0.5646824
## M 0.4406780 0.4353176
ggplot(gv_roster, aes(x=G, fill = G)) + geom_bar()
Definitely more ladies than men, even doggy ladies, but interestingly enough the same percentage of doggy ladies as human ladies.
chisq.test(table(gv_roster$G, gv_roster$Event))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(gv_roster$G, gv_roster$Event)
## X-squared = 0.01565, df = 1, p-value = 0.9004
Age
Taking out the doggies, we can look at the age distribution.
gv_roster %>% filter(Event == "GVRAT") %>%
ggplot(aes(x=A)) +
geom_histogram(binwidth = 1)
## Warning: Removed 2 rows containing non-finite values (stat_bin).
The mean participant age is 43.3639942. The median age of human participants is 43.
I am exactly the median age for this race.
Age vs. Gender
The distributions of ages by gender look remarkably similar.
gv_roster %>% filter(Event == "GVRAT") %>%
ggplot(aes(x=A, fill = G)) + geom_histogram(binwidth = 1) +
facet_grid(G ~ .)
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Looks normally distributed enough to just use normal distribution tests but I went with a t-test anyway. Men were about 7 months older than women in this race.
t.test(gv_roster$A ~ gv_roster$G) # includes doggos
##
## Welch Two Sample t-test
##
## data: gv_roster$A by gv_roster$G
## t = -3.6332, df = 17470, p-value = 0.0002808
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9128226 -0.2730463
## sample estimates:
## mean in group F mean in group M
## 43.10585 43.69879
I guess this shows that with a large enough sample, even small differences in age become significant.
Mileage
Here’s the total miles distribution. There are lots of 0 mile runners. The “Buzzard” is a pretend character that is flying along at the goal pace of 5.204 miles per day. Runners can use the Buzzard to gauge where they are relative to the goal. By 5/20, it was at 104.0721311 miles.
gv_miles %>% filter(Event == "GVRAT") %>%
group_by(Bib) %>%
summarize(total_miles = sum(miles_d, na.rm = T)) %>%
ggplot(aes(x = total_miles)) +
geom_histogram(binwidth = 5) +
geom_vline(xintercept = 634.84/122*20, color = "red")
Uh oh, looks like about just about half of the participants aren’t on track to make it.
gv_totals <- gv_miles %>% filter(Event == "GVRAT") %>%
group_by(Bib) %>%
summarize(total_miles = sum(miles_d, na.rm = T))
quantile(gv_totals$total_miles, seq(0, 1, 0.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 0 34 66 86 98 107 117 129 146 180 624
Total Miles vs. Gender
gv_roster %>% filter(Event == "GVRAT") %>%
ggplot(aes(x=Miles, fill = G)) +
geom_histogram(binwidth = 5) +
facet_grid(G~.)
t.test(Miles ~ G, dat = gv_roster)
##
## Welch Two Sample t-test
##
## data: Miles by G
## t = -14.472, df = 16078, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -15.03826 -11.45047
## sample estimates:
## mean in group F mean in group M
## 104.0505 117.2949
Interestingly men had run a little farther than women on average (including the 0 mile people).
Age vs. Mileage
gv_roster %>% filter(Event == "GVRAT") %>%
ggplot(aes(x = A, y = Miles)) +
geom_point(alpha = 0.01, color = "blue")
## Warning: Removed 2 rows containing missing values (geom_point).
There didn’t seem to be a significant correlation between age and miles run, although on the regression, the older runners were running more in a statistically significant way. The regression parameter estimate is 0.84 total miles/year of age.
summary(lm(Miles ~ A, data = gv_roster))
##
## Call:
## lm(formula = Miles ~ A, data = gv_roster)
##
## Residuals:
## Min 1Q Median 3Q Max
## -149.21 -33.07 -2.92 25.98 523.93
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.18551 1.76606 41.44 <2e-16 ***
## A 0.84467 0.03944 21.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.13 on 19306 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.0232, Adjusted R-squared: 0.02315
## F-statistic: 458.6 on 1 and 19306 DF, p-value: < 2.2e-16
Runners by State and Country
I decided to look at where runners come from. Here’s the list sorted by most runners.
gv_roster %>% filter(Event == "GVRAT") %>%
group_by(Home) %>%
summarize(nrunners = n()) %>%
arrange(-nrunners)
## # A tibble: 142 x 2
## Home nrunners
## <fct> <int>
## 1 USA - Tennessee 1157
## 2 USA - California 995
## 3 Canada -Ontario 959
## 4 USA - Texas 869
## 5 England 847
## 6 USA - Virginia 792
## 7 USA - Ohio 685
## 8 USA - Illinois 682
## 9 USA - New York 652
## 10 USA - Pennsylvania 647
## # … with 132 more rows
Interesting but not surprising that the state of Tennessee had the most runners signed up. Certainly this is way higher per capita than California!
Runners from Hawaii
Here’s the runners from my home state of Hawaii:
gv_roster %>% filter(grepl("Hawaii", Home))
## # A tibble: 42 x 7
## Bib name Event G A Home Miles
## <dbl> <chr> <fct> <fct> <dbl> <fct> <dbl>
## 1 9385 Duane Zitta GVRAT M 39 USA - Hawaii 261.
## 2 11868 Sally Cravens GVRAT F 61 USA - Hawaii 192.
## 3 17429 Aaron Reed GVRAT M 28 USA - Hawaii 182.
## 4 6786 April Shrum GVRAT F 42 USA - Hawaii 169.
## 5 16019 Kana Yamamoto GVRAT F 45 USA - Hawaii 158.
## 6 6906 Robert Davidson GVRAT M 37 USA - Hawaii 155.
## 7 16629 michael hee GVRAT M 60 USA - Hawaii 154.
## 8 11636 Sean Thompson GVRAT M 39 USA - Hawaii 150.
## 9 13815 Andrew Borrebach GVRAT M 29 USA - Hawaii 143.
## 10 7202 Chris Robinson GVRAT M 39 USA - Hawaii 134.
## # … with 32 more rows
Miles by State
gv_roster %>% filter(Event == "GVRAT") %>%
filter(grepl("USA", Home)) %>%
ggplot(aes(x = reorder(Home, Miles, FUN = median),y = Miles)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
xlab("State")
Hawaii is the 49th place USA location by median miles. It’s hard to tell from the boxplots but the difference in miles between Hawaii and Nebraska is over 20 miles, so it’s a pretty significant percent difference.
gv_roster %>% filter(Event == "GVRAT") %>%
filter(grepl("USA", Home)) %>%
group_by(Home) %>%
summarize(medianMiles = median(Miles)) %>%
arrange(-medianMiles)
## # A tibble: 53 x 2
## Home medianMiles
## <fct> <dbl>
## 1 USA - Nebraska 118.
## 2 USA - West Virginia 117.
## 3 USA - APO 112.
## 4 USA - Alabama 112.
## 5 USA - North Dakota 112.
## 6 USA - Colorado 112.
## 7 USA - New Mexico 111.
## 8 USA - Connecticut 109.
## 9 USA - Kentucky 108.
## 10 USA - Florida 108.
## # … with 43 more rows
Discussion
I didn’t include the finishers in the race who are now going back across Tennessee to Arkansas. There were 11 as of this data pull. Ideas for next time:
- looking closer at my progress towards the goal (how my distance run affects my daily mileage required etc.)
- how many people were cheating at logging early miles (not supposed to log miles made before you signed up)
- re-creating the tidal wave chart that shows the histogram of total miles by date