Using Bookdown to create a Security Now Ebook

Recently I have been exploring the blogdown package to redo my website using Hugo, and it resurrected an old idea that I had. Security Now, one of my favorite podcasts, has transcriptions of each episode, and I thought it would be neat to put them into an e-book and read the old episodes that way. I tried using Calibre, but while it did work, I wasn’t able to figure out how to format it nicely. The creator of blogdown, Yihui Xue, has also written a packages called bookdown which looked like it had the right features to make a proper looking e-book.

library(bookdown)
library(dplyr)
## Warning: Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

I downloaded the episode transcripts from GRC.com.

Obtaining Text

DIRECTORY <- "../datasets/sn501/"

# change the i to go from the first show number you want to the last show number. As of 1 July 2017, the most recent show is #618.

for(i in 500:599) {
  shortname <- paste0("sn-", i, ".txt")
  showname <- paste0("https://www.grc.com/sn/", shortname)
  print(shortname)
  download.file(showname, destfile = paste0(DIRECTORY, shortname), method = "curl")
}

I loaded an example file and put it into a tibble.

<<<<<<< Updated upstream
DIRECTORY <- "../datasets/sn501/"
# list.files(DIRECTORY)
episode <- readLines(paste0(DIRECTORY, "sn-501.txt"))
episode[1:17]
##  [1] "GIBSON RESEARCH CORPORATION\t\thttps://www.GRC.com/"                                                                                                                                                                                                                                                                                                                                                                  
##  [2] ""                                                                                                                                                                                                                                                                                                                                                                                                                     
##  [3] "SERIES:\t\tSecurity Now!"                                                                                                                                                                                                                                                                                                                                                                                             
##  [4] "EPISODE:\t#501"                                                                                                                                                                                                                                                                                                                                                                                                       
##  [5] "DATE:\t\tMarch 31, 2015"                                                                                                                                                                                                                                                                                                                                                                                              
##  [6] "TITLE:\t\tListener Feedback #209"                                                                                                                                                                                                                                                                                                                                                                                     
##  [7] "HOSTS:\tSteve Gibson & Leo Laporte"                                                                                                                                                                                                                                                                                                                                                                                   
##  [8] "SOURCE:\thttp://media.GRC.com/sn/SN-501.mp3"                                                                                                                                                                                                                                                                                                                                                                          
##  [9] "ARCHIVE:\thttp://www.GRC.com/securitynow.htm "                                                                                                                                                                                                                                                                                                                                                                        
## [10] ""                                                                                                                                                                                                                                                                                                                                                                                                                     
## [11] "DESCRIPTION:  Leo and I discuss the week's major security events and discuss questions and comments from listeners of previous episodes.  We tie up loose ends, explore a wide range of topics that are too small to fill their own episode, clarify any confusion from previous installments, and present real world application notes for any of the security technologies and issues we have previously discussed."
## [12] ""                                                                                                                                                                                                                                                                                                                                                                                                                     
## [13] "SHOW TEASE:  It's time for Security Now!.  Steve Gibson is here.  I'm here.  We're going to talk about the latest security news.  Yes, another survey of bad passwords coming up.  And then 10 great questions from you, our audience members.  Stay tuned.  Security Now! is next."                                                                                                                                  
## [14] ""                                                                                                                                                                                                                                                                                                                                                                                                                     
## [15] "LEO LAPORTE:  This is Security Now! with Steve Gibson, Episode 501, recorded Tuesday, March 31st, 2015:  Your questions, Steve's answers, #209."                                                                                                                                                                                                                                                                      
## [16] ""                                                                                                                                                                                                                                                                                                                                                                                                                     
## [17] "It's time for Security Now!, the show where we protect you and your loved ones online, your privacy, your security, with this guy right here, the Explainer in Chief, Steven \"Tiberius\" Gibson.  And he is here once again to both put us all on edge, and then make to us feel better about..."
=======
DIRECTORY <- "~/Dropbox/Mike/securitynow/shows/"
# list.files(DIRECTORY)
episode <- readLines(paste0(DIRECTORY, "sn-400.txt"))
episode[1:17]
##  [1] "GIBSON RESEARCH CORPORATION\t\thttp://www.GRC.com/"                                                                                                                                                                                                                                                                   
##  [2] ""                                                                                                                                                                                                                                                                                                                     
##  [3] "SERIES:\t\tSecurity Now!"                                                                                                                                                                                                                                                                                             
##  [4] "EPISODE:\t#400\t"                                                                                                                                                                                                                                                                                                     
##  [5] "DATE:\t\tApril 17, 2013"                                                                                                                                                                                                                                                                                              
##  [6] "TITLE:\t\tVPN Solutions"                                                                                                                                                                                                                                                                                              
##  [7] "SPEAKERS:\tSteve Gibson & Leo Laporte"                                                                                                                                                                                                                                                                                
##  [8] "SOURCE FILE:\thttp://media.GRC.com/sn/SN-400.mp3"                                                                                                                                                                                                                                                                     
##  [9] "FILE ARCHIVE:\thttp://www.GRC.com/securitynow.htm "                                                                                                                                                                                                                                                                   
## [10] ""                                                                                                                                                                                                                                                                                                                     
## [11] "DESCRIPTION:  After catching up with a wild week of security events, we revisit a topic from the earliest episodes of the Security Now podcast:  Virtual Private Networks.  This coincides with the introduction of a new sponsor on the TWIT network, proXPN, a VPN provider that truly looks like the right choice."
## [12] ""                                                                                                                                                                                                                                                                                                                     
## [13] "SHOW TEASE:  It's time for Security Now!, our 400th episode.  Let's celebrate with Steve, talk about Java - yes, there's another update - talk about security and a little intro to VPN systems.  It's all coming up next on Security Now!."                                                                          
## [14] ""                                                                                                                                                                                                                                                                                                                     
## [15] "LEO LAPORTE:  This is Security Now! with Steve Gibson, Episode 400, recorded April 17th, 2013:  VPN Solutions."                                                                                                                                                                                                       
## [16] ""                                                                                                                                                                                                                                                                                                                     
## [17] "It's time for Security Now!, the show that protects you, your loved ones, and your privacy online.  And it's all thanks to this man here, the Explainer in Chief, Steve Gibson of GRC.com.  Steve joins us every week.  Hi, Steve."
>>>>>>> Stashed changes

From what I could tell, the first 13 lines appear to be formatted as metadata. These would be helpful for formatting as header text for bookdown to work on.

Bookdown requires the first line of the file to be a chapter title with a first level heading. I think a useful format would be the title. Other metadata like Series and Archive could be discarded since would probably be the same for every file.

Conceptually for each transcript, I had to discard lines 1-3 and 9, and move the title to the top. The rest would probably be ok to stay for now. But what if episodes were not identically formatted like this? I should use regular expressions.

line_grc <- grep("^GIBSON", episode)
line_series <- grep("^SERIES", episode)
line_archive <- grep("^ARCHIVE", episode)

episode <- episode[-c(line_grc, line_series, line_archive)]

line_title <- grep("^TITLE", episode)
episode_length <- length(episode)
episode <- episode[c(line_title, 1:episode_length)]
episode <- episode[-(line_title + 1)]

episode <- gsub("\t", " ", episode)
episode <- sub("TITLE:  ", "# ", episode)
head(episode)
<<<<<<< Updated upstream
## [1] "# Listener Feedback #209"                  
## [2] ""                                          
## [3] "EPISODE: #501"                             
## [4] "DATE:  March 31, 2015"                     
## [5] "HOSTS: Steve Gibson & Leo Laporte"         
## [6] "SOURCE: http://media.GRC.com/sn/SN-501.mp3"
writeLines(episode, paste0(DIRECTORY, "sn-501.Rmd"))

So that’s the process for one episode, how ’bout 501-600?

for(i in 501:599) {
=======
## [1] "# VPN Solutions"                                
## [2] ""                                               
## [3] "EPISODE: #400 "                                 
## [4] "DATE:  April 17, 2013"                          
## [5] "SPEAKERS: Steve Gibson & Leo Laporte"           
## [6] "SOURCE FILE: http://media.GRC.com/sn/SN-400.mp3"
writeLines(episode, paste0(DIRECTORY, "sn-400.Rmd"))

So that’s the process for one episode, how ’bout 501-600?

for(i in 400:410) {
>>>>>>> Stashed changes
  episode <- readLines(paste0(DIRECTORY, "sn-", i, ".txt"))
  
  line_grc <- grep("^GIBSON", episode)
  line_series <- grep("^SERIES", episode)
  line_archive <- grep("^ARCHIVE", episode)

  episode <- episode[-c(line_grc, line_series, line_archive)]

  line_title <- grep("^TITLE", episode)
  episode_length <- length(episode)
  episode <- episode[c(line_title, 1:episode_length)]
  episode <- episode[-(line_title + 1)]

  episode <- gsub("\t", " ", episode)
  episode <- sub("TITLE:  ", "# ", episode)

  writeLines(episode, paste0(DIRECTORY, "sn-", i, ".Rmd"))
}