Eighty-five Cent Words

Tonight my wife came to me with the problem of Dollar Words. The idea is that you can assign a numerical value to each letter of the alphabet (a = 1, b = 2 … z = 26). If you replace each letter of a word with the value and add the values up, you get a sum. If the sum adds up to 100, that’s called a dollar word. There are many lists of dollar words on the internet, but the challenge my wife presented was to come up with as many 85 cent words as I could. I tried for a few minutes but was not able to find any! Of course, this called for a brute force attack.

I remembered some time ago looking at natural language processing to break a “corpus” of text up into “tokens”, or smaller units of text. These units could be single words. Each word could then be calculated and then I could figure out which ones were 85 cents.

I used the Tidy Text Mining with R example to come up with a corpus of words. The example they use is all the words in Jane Austen’s published works, which is a little dated, but sufficient.

library(tidytext)
library(janeaustenr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)

First, some helper functions:

# converts a letter into its value, such as a to 1, z to 26
letterToInt <- function(x) {
  which(letters == x)
}

# determines if a word has non-letters such as punctuation or numbers
hasNonLetters <- function(x) {
  str_detect(x, "\\W|[[:digit:]]|[[:punct:]]")
}

# calculates the dollar value of a word
valueCalculate <- function(x) {
  x <- tolower(x)                               # makes word all lowercase
  wordsplit <- strsplit(x, split = "")          # splits word into letters
  sum(sapply(unlist(wordsplit), letterToInt))   # calculates word value
}

Then I loaded the Jane Austen corpus and made them into a list of words. This is copied entirely from the Tidy Text Mining web site.

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]",
                                           ignore_case = TRUE)))) %>%
  ungroup()


tidy_books <- original_books %>%
  unnest_tokens(word, text)

This results in a list of 725,055 tokens such as:

head(tidy_books)
## # A tibble: 6 x 4
##   book                linenumber chapter word       
##   <fct>                    <int>   <int> <chr>      
## 1 Sense & Sensibility          1       0 sense      
## 2 Sense & Sensibility          1       0 and        
## 3 Sense & Sensibility          1       0 sensibility
## 4 Sense & Sensibility          3       0 by         
## 5 Sense & Sensibility          3       0 jane       
## 6 Sense & Sensibility          3       0 austen

Looking just at the words, there were only 14,520 unique words, or tokens.

wordsall <- unique(tidy_books$word)
length(wordsall)
## [1] 14520

Here’s the calculation heavy step, getting rid of all the words that have letters or punctuation. I suppose I could have just removed all the letters or punctuations from words instead…

wordsall <- wordsall[!sapply(wordsall, hasNonLetters)]

After this, all I had to do was calculate the values of the words that were left, and come up with a list of the ones that were equal to 85 cents.

wordsall <- sapply(wordsall, valueCalculate)
words85 <- which(wordsall == 85)
sort(labels(words85))
##   [1] "actress"     "admirably"   "advising"    "affluent"    "aggrandise" 
##   [6] "attainable"  "awakening"   "believing"   "belonging"   "blister"    
##  [11] "blundered"   "bottom"      "bracelets"   "british"     "burning"    
##  [16] "catalogue"   "censure"     "chaperone"   "cherries"    "circular"   
##  [21] "compelled"   "confines"    "contained"   "convince"    "creation"   
##  [26] "deceitful"   "decrying"    "dejection"   "deliberated" "desiring"   
##  [31] "despatched"  "devolve"     "disregard"   "dixons"      "donwell"    
##  [36] "dormant"     "dramatis"    "eltons"      "endurance"   "enrolled"   
##  [41] "entitle"     "espalier"    "events"      "excites"     "falsehood"  
##  [46] "faulty"      "fearless"    "ferrars"     "figures"     "fitting"    
##  [51] "fixedly"     "floors"      "formidable"  "forward"     "fricassee"  
##  [56] "glories"     "gossip"      "grandmamma"  "grieves"     "hawkins"    
##  [61] "hedgerow"    "heightened"  "hollow"      "humming"     "impeaching" 
##  [66] "inaction"    "inclines"    "incurable"   "indelicacy"  "instance"   
##  [71] "intends"     "kitty"       "lengthen"    "lengths"     "library"    
##  [76] "lively"      "maddoxes"    "mansion"     "memento"     "mingling"   
##  [81] "moderated"   "nought"      "objecting"   "obtrude"     "oddities"   
##  [86] "original"    "others"      "palpably"    "parting"     "passing"    
##  [91] "patient"     "pattened"    "peculiar"    "pelisse"     "pembroke"   
##  [96] "poachers"    "presence"    "proceeds"    "rapidly"     "reciting"   
## [101] "reckons"     "remarks"     "remnant"     "replacing"   "repute"     
## [106] "residing"    "resumed"     "retract"     "revived"     "ridiculed"  
## [111] "rubbers"     "sagacity"    "sensible"    "sentence"    "separate"   
## [116] "settled"     "shameful"    "shewing"     "silver"      "sprigged"   
## [121] "starve"      "stealth"     "steeles"     "stepped"     "stirs"      
## [126] "stolen"      "stoop"       "storm"       "streamed"    "styled"     
## [131] "swears"      "thereon"     "tilney"      "tiptoe"      "titles"     
## [136] "tupman"      "unaffected"  "uneasy"      "unfeigned"   "unlocked"   
## [141] "urgent"      "vacation"    "vaulted"     "violence"    "waiving"    
## [146] "watching"    "weddings"    "wildly"      "wright"      "xlviii"     
## [151] "yielding"

The result consisted of 151 words reflecting the beauty of Austen’s writing. What a fun exercise!

Before I go to bed, here’s the distribution of dollar values of words in Austen’s writing with the dotted red line at 85:

ggplot(as_tibble(wordsall), aes(x = value)) + 
  geom_histogram(binwidth = 1) +
  xlab("Dollar Value") +
  ggtitle("Distribution of Words' Dollar Values in Jane Austen's Novels") +
  geom_vline(xintercept = 85, linetype = "dotted", color = "red")