Text Analysis of Security Now: tf-idf

In the Security Now podcast, the hosts Steve Gibson and Leo Laporte often cite episodes that explain certain basic topics. They can never remember which episode covered a certain topic (e.g., how TLS works). As a follow-up to my previous post about text mining with the tidytext package, I decided to use see if I could make it easier to create an index for the series using term frequency and inverse document frequency analysis (tf-idf).

Methods

I again used the tidyverse packages and tidytext by David Robinson. He and Julia Silge wrote the book from which most of these methods came.

library(dplyr)

## Warning: Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidytext)
library(ggplot2)
library(tidyr)

Obtaining Text

I downloaded the episode transcripts from GRC.com.

DIRECTORY <- "~/Dropbox/Mike/securitynow/shows"

# change the i to go from the first show number you want to the last show number. As of 1 July 2017, the most recent show is #618.

for(i in 400:499) {
  shortname <- paste0("sn-", i, ".txt")
  showname <- paste0("https://www.grc.com/sn/", shortname)
  print(shortname)
  download.file(showname, destfile = paste0(DIRECTORY, shortname), method = "curl")
}

I read each of the transcripts into one tidy data frame.

DIRECTORY <- "~/Dropbox/Mike/securitynow/shows/"

sn_text <- data_frame(linenumber = NA, episode = NA, linetext = NA)
for(i in 400:403) {
  sn_text_raw <- readLines(paste0(DIRECTORY, "sn-", i, ".txt"))
  sn_text_df <- data_frame(linenumber = 1:length(sn_text_raw),
                           episode = i,
                           linetext = sn_text_raw)
  sn_text <- bind_rows(sn_text, sn_text_df)
}
sn_text <- sn_text[-1, ]
rm(sn_text_df)
rm(sn_text_raw)

Term Frequency

I broke the lines into individual words using unnest_tokens() and counted them. I then counted the total words in each episode.

episode_words <- sn_text %>%
  unnest_tokens(word, linetext) %>%
#  anti_join(custom_stop_words, by = "word") %>% 
  count(episode, word, sort = TRUE) %>%
  ungroup()

total_words <- episode_words %>% 
  group_by(episode) %>% 
  summarize(total = sum(n))

episode_words <- left_join(episode_words, total_words, by = "episode")

episode_words

## # A tibble: 9,101 x 4
##    episode  word     n total
##      <int> <chr> <int> <int>
##  1     403   the   624 14736
##  2     401   the   529 15367
##  3     402   the   488 13008
##  4     401   and   423 15367
##  5     402   and   398 13008
##  6     403   and   395 14736
##  7     400   the   381 11366
##  8     403    to   370 14736
##  9     401     a   356 15367
## 10     401     i   354 15367
## # ... with 9,091 more rows

It was not surprising to see that common stop words like “the” accounted for most of the words.

I then calcuated the term frequency (frequency of a word as a percentage of the total number of words in the episode) and sorted by the absolute count.

freq_by_rank <- episode_words %>% 
  group_by(episode) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total)

freq_by_rank

## # A tibble: 9,101 x 6
## # Groups:   episode [4]
##    episode  word     n total  rank `term frequency`
##      <int> <chr> <int> <int> <int>            <dbl>
##  1     403   the   624 14736     1       0.04234528
##  2     401   the   529 15367     1       0.03442442
##  3     402   the   488 13008     1       0.03751538
##  4     401   and   423 15367     2       0.02752652
##  5     402   and   398 13008     2       0.03059656
##  6     403   and   395 14736     2       0.02680510
##  7     400   the   381 11366     1       0.03352103
##  8     403    to   370 14736     3       0.02510858
##  9     401     a   356 15367     3       0.02316653
## 10     401     i   354 15367     4       0.02303638
## # ... with 9,091 more rows

It’s also possible to plot the log rank and term frequency. This plot appears to obey Zipf’s Law. That is, the term frequency is proportional to its rank. The relationship appears to be:

\[ frequency = {1 \over rank} \]

With a logarithm applied to both sides, we have:

\[ log(frequency) = {-log({rank)}} \] This can be demonstrated in the following plot where I have plotted the log term frequency against its log rank.

freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = factor(episode))) + 
  geom_line(size = 1.2, alpha = 0.2) + 
  scale_x_log10() +
  scale_y_log10()

Inverse Document Frequency

The inverse document frequency is a rough rule of thumb that can be used to look at what words are important to a specific document compared to the whole corpus of documents. Idf identifies words that are not evenly distributed across the documents.

It is defined as the log of the number of documents divided by the number of documents containing the term. As a term becomes less frequent, the ratio can increase to the number of total documents. As it becomes more frequent, the ratio can decrease to a minimum of 1 (whose natural log is 0).

\[ idf(term) = ln({n_{documents} \over n_{documents \space containing \space term}}) \] When we multiply idf by term frequency, the result weights the idf for how often the word is seen in that particular document.

The tidytext package contains an bind_tf_idf function that makes it easy to calculate.

episode_words <- episode_words %>%
  bind_tf_idf(word, episode, n)
episode_words

## # A tibble: 9,101 x 7
##    episode  word     n total         tf   idf tf_idf
##      <int> <chr> <int> <int>      <dbl> <dbl>  <dbl>
##  1     403   the   624 14736 0.04234528     0      0
##  2     401   the   529 15367 0.03442442     0      0
##  3     402   the   488 13008 0.03751538     0      0
##  4     401   and   423 15367 0.02752652     0      0
##  5     402   and   398 13008 0.03059656     0      0
##  6     403   and   395 14736 0.02680510     0      0
##  7     400   the   381 11366 0.03352103     0      0
##  8     403    to   370 14736 0.02510858     0      0
##  9     401     a   356 15367 0.02316653     0      0
## 10     401     i   354 15367 0.02303638     0      0
## # ... with 9,091 more rows

I arranged the output by descending tf_idf to see what are the words that were different between episodes.

episode_words %>%
  select(-total) %>%
  arrange(desc(tf_idf))

## # A tibble: 9,101 x 6
##    episode       word     n           tf       idf      tf_idf
##      <int>      <chr> <int>        <dbl>     <dbl>       <dbl>
##  1     402     secret    46 0.0035362854 0.6931472 0.002451166
##  2     400       java    37 0.0032553229 0.6931472 0.002256418
##  3     400     minder    16 0.0014077072 1.3862944 0.001951497
##  4     402 bittorrent    33 0.0025369004 0.6931472 0.001758445
##  5     402     folder    16 0.0012300123 1.3862944 0.001705159
##  6     403     quinto    16 0.0010857763 1.3862944 0.001505206
##  7     403    quantum    15 0.0010179153 1.3862944 0.001411130
##  8     401    spheres    14 0.0009110431 1.3862944 0.001262974
##  9     400     applet    10 0.0008798170 1.3862944 0.001219685
## 10     400       clip    10 0.0008798170 1.3862944 0.001219685
## # ... with 9,091 more rows

plot_episode <- episode_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))

plot_episode %>% 
  top_n(20) %>%
  ggplot(aes(word, tf_idf, fill = factor(episode))) +
  geom_col() +
  labs(x = NULL, y = "tf-idf") +
  coord_flip()

## Selecting by tf_idf

Looks like episode 403 had a bunch about quantum computing while episode 401 featured stories on TrueCrypt and Nest. Episode 402 seemed to talk about secrets (perhaps Bittorrent Sync).

By episode, we can visualize the top 10 tf_idf words.

plot_episode %>% 
  group_by(episode) %>% 
  top_n(10, tf_idf) %>% 
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(word, tf_idf, fill = factor(episode))) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~episode, ncol = 2, scales = "free") +
  coord_flip()

Based on this I thought I could identify which episodes talked the most about any particular topic by seeing which episodes had the highest tf_idf for that word. It ended up being kind of a mess.

sn_text <- data_frame(linenumber = NA, episode = NA, linetext = NA)
for(i in 400:499) {
  sn_text_raw <- readLines(paste0(DIRECTORY, "sn-", i, ".txt"))
  sn_text_df <- data_frame(linenumber = 1:length(sn_text_raw),
                           episode = i,
                           linetext = sn_text_raw)
  sn_text <- bind_rows(sn_text, sn_text_df)
}
sn_text <- sn_text[-1, ]
rm(sn_text_df)
rm(sn_text_raw)

episode_words <- sn_text %>%
  unnest_tokens(word, linetext) %>%
#  anti_join(custom_stop_words, by = "word") %>% 
  count(episode, word, sort = TRUE) %>%
  ungroup()

total_words <- episode_words %>% 
  group_by(episode) %>% 
  summarize(total = sum(n))

episode_words <- left_join(episode_words, total_words, by = "episode")

episode_words %>% 
  bind_tf_idf(word, episode, n) %>%
  filter(tf_idf>0, n>15) %>%
  group_by(word) %>% 
  arrange(desc(tf_idf)) %>%
  top_n(1) %>%
  ungroup() %>% 
  arrange(desc(tf_idf)) %>%
  select(episode, word, n, tf_idf)

## Selecting by tf_idf

## # A tibble: 657 x 4
##    episode    word     n     tf_idf
##      <int>   <chr> <int>      <dbl>
##  1     463      fr   103 0.02151263
##  2     404 brokers    40 0.01546787
##  3     431  radius    54 0.01416449
##  4     404    iyaz    55 0.01228141
##  5     497     lee    65 0.01186939
##  6     497     pat    63 0.01150418
##  7     457   brett   101 0.01087039
##  8     404 bluekai    28 0.01082751
##  9     423     tom   114 0.01066308
## 10     490  enigma    57 0.01019011
## # ... with 647 more rows

At least we can see that there were some episodes that taked about Radius and Bluekai. Enigma was in episode 490. Scrolling further yields episodes on Sony, Bittorrent Sync, etc. Acxiom came up in episode 404 while pairing (of something or other) must have been big in episode 465.

Discussion

Generating an index by algorithm, at least for Security Now, did not result in a good output. It seems that there is still a lot of manual validation that would be needed to create a useful index.

Conclusion

Tidytext makes it easy to see what words are important to what documents but that doesn’t mean that the result will be meaningful.

textmining