I just found out about Tidy Tuesday, an educational exercise from the R for Data Science folks. The idea is that they publish a dataset every Tuesday for people to play around with. If you make something you’re proud of you can publish it on Twitter and use the hashtag #tidytuesday. I haven’t tried it before, but today’s one was a topic that was of interest to me–computer security and passwords.
Data Loading
I figured out how to load directly from the online csv.
library(dplyr)
library(ggplot2)
passwords <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv')
## Parsed with column specification:
## cols(
## rank = col_double(),
## password = col_character(),
## category = col_character(),
## value = col_double(),
## time_unit = col_character(),
## offline_crack_sec = col_double(),
## rank_alt = col_double(),
## strength = col_double(),
## font_size = col_double()
## )
passwords <- tbl_df(passwords)
glimpse(passwords)
## Observations: 507
## Variables: 9
## $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ password <chr> "password", "123456", "12345678", "1234", "qwerty",…
## $ category <chr> "password-related", "simple-alphanumeric", "simple-…
## $ value <dbl> 6.91, 18.52, 1.29, 11.11, 3.72, 1.85, 3.72, 6.91, 6…
## $ time_unit <chr> "years", "minutes", "days", "seconds", "days", "min…
## $ offline_crack_sec <dbl> 2.170e+00, 1.110e-05, 1.110e-03, 1.110e-07, 3.210e-…
## $ rank_alt <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ strength <dbl> 8, 4, 4, 4, 8, 4, 8, 4, 7, 8, 8, 1, 32, 9, 9, 8, 8,…
## $ font_size <dbl> 11, 8, 8, 8, 11, 8, 11, 8, 11, 11, 11, 4, 23, 12, 1…
Data Munging
The data variable types were already pretty well auto-classified but the category needed to be made into a factor.
passwords$category <- as.factor(passwords$category)
Exploration
I figured I would look at them by category. Here is the distribution by category. Some had NA category.
table(passwords$category, useNA = "ifany")
##
## animal cool-macho fluffy food
## 29 79 44 11
## name nerdy-pop password-related rebellious-rude
## 183 30 15 11
## simple-alphanumeric sport <NA>
## 61 37 7
prop.table(table(passwords$category, useNA = "ifany"))
##
## animal cool-macho fluffy food
## 0.05719921 0.15581854 0.08678501 0.02169625
## name nerdy-pop password-related rebellious-rude
## 0.36094675 0.05917160 0.02958580 0.02169625
## simple-alphanumeric sport <NA>
## 0.12031558 0.07297830 0.01380671
passwords <- passwords %>% filter(!is.na(category))
Here’s how hard the passwords were to crack by category. They were all pretty easy to crack.
table(passwords$offline_crack_sec)
##
## 1.11e-07 1.11e-06 4.75e-06 1.11e-05 0.000111 0.000124 0.000622 0.00111
## 11 2 31 18 3 39 1 5
## 0.00321 0.0111 0.0224 0.0835 0.806 2.17 29.02 29.27
## 233 1 4 87 5 56 3 1
passwords %>% ggplot(aes(x = category, y = offline_crack_sec)) +
geom_boxplot() + scale_y_log10() + coord_flip()
Popularity vs Difficulty to Crack
Was there a correlation between how popular they were and how easy they were to crack? Not clearly.
passwords %>% mutate(rank_crack = min_rank(offline_crack_sec)) %>%
ggplot(aes(x = rank, y = rank_crack, color = category)) +
geom_point()
Conclusion
I couldn’t figure out much to do with this dataset but at least I got an idea of how to load a raw csv from Github.