Goodreads Analysis

I just finished my 2017 Reading Challenge on Goodreads. My goal was to read 15 books this year. Poking around the site I discovered that I could export my data. I decided to have a look to see what my reading habits looked like, and since I was doing this for me, I decided to look at my wife’s data too.

Dataset

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(hrbrthemes)
books <- read.csv("../datasets/goodreads.csv", colClasses = "character")
books_wife <- read.csv("../datasets/goodreads_wife.csv", colClasses = "character")
books <- tbl_df(books)
books_wife <- tbl_df(books_wife)
books <- mutate(books, reader = "me")
books_wife <- mutate(books_wife, reader = "wife")
books <- full_join(books, books_wife)
## Joining, by = c("Book.Id", "Title", "Author", "Author.l.f", "Additional.Authors", "ISBN", "ISBN13", "My.Rating", "Average.Rating", "Publisher", "Binding", "Number.of.Pages", "Year.Published", "Original.Publication.Year", "Date.Read", "Date.Added", "Bookshelves", "Bookshelves.with.positions", "Exclusive.Shelf", "My.Review", "Spoiler", "Private.Notes", "Read.Count", "Recommended.For", "Recommended.By", "Owned.Copies", "Original.Purchase.Date", "Original.Purchase.Location", "Condition", "Condition.Description", "BCID", "reader")

The data were arranged into variables like Author and Title but also Pages, Publication Date, Date Read, My Rating, Average Rating, etc.

str(books)
## Classes 'tbl_df', 'tbl' and 'data.frame':    491 obs. of  32 variables:
##  $ Book.Id                   : chr  "3852882" "31920777" "30653783" "840" ...
##  $ Title                     : chr  "Your Hate Mail Will Be Graded: A Decade of Whatever, 1998-2008" "American Kingpin: The Epic Hunt for the Criminal Mastermind Behind the Silk Road" "Smart Baseball: The Story Behind the Old Stats That Are Ruining the Game, the New Ones That Are Running It, and"| __truncated__ "The Design of Everyday Things" ...
##  $ Author                    : chr  "John Scalzi" "Nick Bilton" "Keith Law" "Donald A. Norman" ...
##  $ Author.l.f                : chr  "Scalzi, John" "Bilton, Nick" "Law, Keith" "Norman, Donald A." ...
##  $ Additional.Authors        : chr  "" "" "Tbd" "" ...
##  $ ISBN                      : chr  "=\"1596062118\"" "=\"1591848148\"" "=\"0062490222\"" "=\"0465067107\"" ...
##  $ ISBN13                    : chr  "=\"9781596062115\"" "=\"9781591848141\"" "=\"9780062490223\"" "=\"9780465067107\"" ...
##  $ My.Rating                 : chr  "4" "0" "0" "0" ...
##  $ Average.Rating            : chr  "3.67" "4.36" "4.10" "4.18" ...
##  $ Publisher                 : chr  "Subterranean" "Portfolio" "Harper Collins" "Basic Books" ...
##  $ Binding                   : chr  "Hardcover" "Hardcover" "Hardcover" "Paperback" ...
##  $ Number.of.Pages           : chr  "368" "352" "304" "240" ...
##  $ Year.Published            : chr  "2008" "2017" "2017" "2002" ...
##  $ Original.Publication.Year : chr  "2008" "2017" "2017" "1988" ...
##  $ Date.Read                 : chr  "2017/06/17" "" "" "" ...
##  $ Date.Added                : chr  "2017/06/17" "2017/06/16" "2017/06/10" "2017/06/10" ...
##  $ Bookshelves               : chr  "" "to-read" "to-read" "to-read" ...
##  $ Bookshelves.with.positions: chr  "" "to-read (#16)" "to-read (#15)" "to-read (#14)" ...
##  $ Exclusive.Shelf           : chr  "read" "to-read" "to-read" "to-read" ...
##  $ My.Review                 : chr  "" "" "" "" ...
##  $ Spoiler                   : chr  "" "" "" "" ...
##  $ Private.Notes             : chr  "" "" "" "" ...
##  $ Read.Count                : chr  "1" "0" "0" "0" ...
##  $ Recommended.For           : chr  "" "" "" "" ...
##  $ Recommended.By            : chr  "" "" "" "" ...
##  $ Owned.Copies              : chr  "0" "0" "0" "0" ...
##  $ Original.Purchase.Date    : chr  "" "" "" "" ...
##  $ Original.Purchase.Location: chr  "" "" "" "" ...
##  $ Condition                 : chr  "" "" "" "" ...
##  $ Condition.Description     : chr  "" "" "" "" ...
##  $ BCID                      : chr  "" "" "" "" ...
##  $ reader                    : chr  "me" "me" "me" "me" ...

Data Cleaning

Some boring data cleaning code…

# Factor author names
books$Author <- factor(books$Author)

# Factor bookshelves
books$Exclusive.Shelf <- factor(books$Exclusive.Shelf)

# Numeric ratings
books$My.Rating <- as.numeric(books$My.Rating)
books$Average.Rating <- as.numeric(books$Average.Rating)

# Number of Pages
books$Number.of.Pages <- as.integer(books$Number.of.Pages)

# Years
books$Year.Published <- as.integer(books$Year.Published)
books$Original.Publication.Year <- as.integer(books$Original.Publication.Year)

# Dates
books$Date.Added <- ymd(books$Date.Added)
books$Date.Read <- ymd(books$Date.Read)

Books Read vs Added

I’ve recorded 150 books as being read, and the wife has recorded 302 books as being read.

books %>% select(Exclusive.Shelf, reader) %>% group_by(reader, Exclusive.Shelf) %>% 
  summarize(n = length(Exclusive.Shelf))
## Source: local data frame [6 x 3]
## Groups: reader [?]
## 
##   reader   Exclusive.Shelf     n
##    <chr>            <fctr> <int>
## 1     me currently-reading     2
## 2     me              read   150
## 3     me           to-read    16
## 4   wife currently-reading     1
## 5   wife              read   302
## 6   wife           to-read    20

Dates Added and Read

I have only been adding to this list off and on since joining Goodreads. I plotted below the distribution of when I added and read books.

tmp1 <- books %>% select(Book.Id, Date.Added, reader) %>% mutate(action = "added") %>% rename(year = Date.Added)
tmp2 <- books %>% select(Book.Id, Date.Read, reader) %>% mutate(action = "read") %>% rename(year = Date.Read)
bind_rows(tmp1, tmp2) %>% filter(!is.na(year)) %>%
  ggplot(aes(x = year, fill = action)) +
  geom_histogram(binwidth = 365, position=position_dodge()) +
  ggtitle("Books Added and Read per Year") +
  ylab("number of books") +
  xlab("year") +
  theme_ipsum() +
  facet_grid(reader ~ .)

It looks like I signed up for Goodreads in 2012 and started adding books to my list of read books. If I couldn’t remember when I read the book, I left the date read field blank. My wife started in 2009 and had a similar pattern of behavior. After this initial flurry of adding books, I recorded little activity on the website until about 2014-2015 when I started using Goodreads in earnest. This graph doesn’t really represent my reading history since there’s a lot of missing data, but it does represent pretty well how I’ve used this website.

My Ratings

I wondered about the ratings we had given books.

# Unrated books got a rating of zero
books$My.Rating <- ifelse(books$My.Rating == 0, NA, books$My.Rating)

books %>% filter(!is.na(My.Rating)) %>%
  ggplot(aes(x = My.Rating, fill = reader)) + 
  geom_histogram(binwidth = 1, position=position_dodge()) +
  ggtitle("Our Ratings") +
  xlab("Rating") + 
  theme_ipsum()

It looks pretty heavily skewed to 4 and 5 star ratings. In fact, both of our median ratings were a 4.

books %>% group_by(reader) %>% summarize(median(My.Rating, na.rm = T))
## # A tibble: 2 × 2
##   reader `median(My.Rating, na.rm = T)`
##    <chr>                          <dbl>
## 1     me                              4
## 2   wife                              4
books %>% 
  ggplot(aes(x=Average.Rating, fill = reader)) + 
  geom_histogram(binwidth = 0.1) + 
  ggtitle("Community Ratings of Books that We've Added") +
  xlab("Rating") +
  theme_ipsum()

The median rating by the community was actually pretty similar to mine.

median(books$Average.Rating)
## [1] 3.97

Difference between My Ratings and the Masses

Were there books that I enjoyed way more or less than the community? I didn’t have the distribution of the ratings for each book, but I did have the mean and could calculate the difference between the community average rating and mine.

books %>% select(Title, Author, My.Rating, Average.Rating, reader) %>% 
  mutate(dRating = My.Rating - Average.Rating) %>% 
  filter(!is.na(dRating)) %>% arrange(desc(dRating)) %>%
  ggplot(aes(x = dRating, fill = reader)) +
  geom_histogram(binwidth = 0.25, position = "identity", alpha = 0.5) +
  ggtitle("Difference between My Ratings and Community Ratings") +
  xlab("My Rating - Community Rating") +
  theme_ipsum()

Here are the top 10 books that we liked more than the community.

books %>% select(Title, Author, My.Rating, Average.Rating) %>% mutate(dRating = My.Rating - Average.Rating) %>% 
  filter(!is.na(dRating)) %>% arrange(desc(dRating))
## # A tibble: 440 × 5
##                                                                          Title
##                                                                          <chr>
## 1                        FOUND IT! Introducing Geocaching to Kids and Families
## 2                                                      A Hologram for the King
## 3                                                      Twilight (Twilight, #1)
## 4                                   The Meanings of Craft Beer (Kindle Single)
## 5       The Fortune Cookie Chronicles: Adventures in the World of Chinese Food
## 6  Three Cups of Tea: One Man's Mission to Promote Peace ... One School at a T
## 7                                                       The Fantastic Mr. Wani
## 8                                                                   After Dark
## 9                                                After You (Me Before You, #2)
## 10                                                Breaking Dawn (Twilight, #4)
## # ... with 430 more rows, and 4 more variables: Author <fctr>,
## #   My.Rating <dbl>, Average.Rating <dbl>, dRating <dbl>

And the ones we liked worse than the community.

books %>% select(Title, Author, My.Rating, Average.Rating) %>% mutate(dRating = My.Rating - Average.Rating) %>% 
  filter(!is.na(dRating)) %>% arrange(dRating)
## # A tibble: 440 × 5
##                                                                          Title
##                                                                          <chr>
## 1                                            Under Pressure: Cooking Sous Vide
## 2                           A Wind in the Door (A Wrinkle in Time Quintet, #2)
## 3                           Be Different: Adventures of a Free-Range Aspergian
## 4                                        Angels & Demons  (Robert Langdon, #1)
## 5                                             Every Day is an Atheist Holiday!
## 6                                                           Last Chance Saloon
## 7                                                      Grey (Fifty Shades, #4)
## 8                       The Orchid Thief: A True Story of Beauty and Obsession
## 9                                                         The Mermaid's Sister
## 10 Apps for Autism: An Essential Guide to Over 200 Effective Apps for Improvin
## # ... with 430 more rows, and 4 more variables: Author <fctr>,
## #   My.Rating <dbl>, Average.Rating <dbl>, dRating <dbl>

Publication Date

How old were the books I’ve been reading?

books %>% filter(!is.na(Original.Publication.Year)) %>%
  ggplot(aes(x = Original.Publication.Year, fill = reader)) + 
  geom_histogram(binwidth = 5, position = "identity", alpha = 0.5) +
  ggtitle("Original Publication Year") +
  xlab("Publication Year") + 
  theme_ipsum()

These two 17th century books were Shakespeare plays I had read before going to see them live (Othello, Twelfth Night). Taking those out led to this admittedly still skewed distribution.

books %>% filter(Original.Publication.Year > 1900) %>%
  ggplot(aes(x = Original.Publication.Year, fill = reader)) + 
  geom_histogram(binwidth = 2, position = "identity", alpha = 0.5) +
  xlab("Original Publication Year") +
  ggtitle("Original Publication Year (Books Written since 1900)") +
  theme_ipsum()

The distribution is highly skewed, with a median original publication date of 2010.

median(books$Original.Publication.Year, na.rm = T)
## [1] 2010

Those outliers were:

books %>% filter(Original.Publication.Year < 1980) %>% 
  arrange(Original.Publication.Year) %>% 
  select(Original.Publication.Year, Title) %>% 
  as.data.frame()
##    Original.Publication.Year
## 1                       1601
## 2                       1603
## 3                       1811
## 4                       1813
## 5                       1814
## 6                       1817
## 7                       1817
## 8                       1847
## 9                       1868
## 10                      1919
## 11                      1937
## 12                      1937
## 13                      1937
## 14                      1951
## 15                      1955
## 16                      1962
## 17                      1970
## 18                      1970
## 19                      1971
## 20                      1971
## 21                      1973
## 22                      1974
## 23                      1975
## 24                      1978
## 25                      1979
##                                                                    Title
## 1                                                          Twelfth Night
## 2                                                                Othello
## 3                                                  Sense and Sensibility
## 4                                                    Pride and Prejudice
## 5                                                         Mansfield Park
## 6                                                       Northanger Abbey
## 7                                                             Persuasion
## 8                                                              Jane Eyre
## 9                                        Little Women (Little Women, #1)
## 10                                                                 South
## 11                                                            The Hobbit
## 12                                          Their Eyes Were Watching God
## 13                                                       Of Mice and Men
## 14                                                The Catcher in the Rye
## 15                   The Lord of the Rings (The Lord of the Rings, #1-3)
## 16                                  A Wrinkle in Time (Time Quintet, #1)
## 17 Bury My Heart at Wounded Knee: An Indian History of the American West
## 18                         Frog and Toad Are Friends (Frog and Toad, #1)
## 19                                         Encounters with the Archdruid
## 20                   Suzuki Violin School, Volume 1: Piano Accompaniment
## 21                    A Wind in the Door (A Wrinkle in Time Quintet, #2)
## 22                                               Where the Sidewalk Ends
## 23                                       Danny the Champion of the World
## 24                                                         Once a Runner
## 25                                        Wind/Pinball: Two Early Novels

Number of Pages (and over time)

The last thing I looked at was the number of pages we’ve read since beginning recording in earnest.

books %>% mutate(Year_read = factor(year(Date.Read))) %>%
  filter(!is.na(Year_read)) %>%
  group_by(Year_read, reader) %>% 
  summarize(npages = sum(Number.of.Pages, na.rm = T)) %>%
  ggplot(aes(x = Year_read, y = npages, fill = reader)) +
  geom_bar(stat="identity", position = "identity", alpha = 0.5) + 
  xlab("Year Read") + 
  ylab("Number of Pages") +
  ggtitle("Pages Recorded as Read per Year") +
  theme_ipsum() 

This year (2017) has been a big reading year and it’s not even half over yet. I think the summer reading program from my library and the Goodreads Reading Challenge have been big reasons that I have done so much this year.

Date Published vs. Date Read

books %>% select(Date.Read, Original.Publication.Year, reader) %>% 
  filter(!is.na(Date.Read) & !is.na(Original.Publication.Year)) %>%
  ggplot(aes(x = Date.Read, y = Original.Publication.Year, color = reader)) +
  geom_point(alpha = 0.2) +
  theme_ipsum() +
  labs(title = "Date Read vs. Original Publication Year",
       x = "Date Read",
       y = "Original Publication Year")

Shoot, those Shakespeare plays really mess with the plot. I don’t think even putting a log scale would help. I filtered them out to get the plot below.

library(scales)
books %>% select(Date.Read, Original.Publication.Year, reader) %>% 
  filter(!is.na(Date.Read) & !is.na(Original.Publication.Year)) %>%
  filter(Original.Publication.Year > 1700) %>% 
  ggplot(aes(x = Date.Read, y = Original.Publication.Year, color = reader)) +
  geom_point(alpha = 0.2) +
  theme_ipsum() +
  labs(title = "Date Read vs. Original Publication Year (Books after 1700)",
       x = "Date Read",
       y = "Original Publication Year")