lapp03.github.io/blogpost1.Rmd at master · lapp03/lapp03.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
---
title: "Using NLP to Analyze Stand Up Comedy"
output:
  html_document:
    keep_md: true
    toc: true
    toc_float: true
    code_folding: hide
    fig_height: 6
    fig_width: 12
    fig_align: 'center'
    theme: 'journal'
    css: styles.css
---

```{r, echo=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```

```{r load_libraries, include=FALSE}
library(tidyverse)
library(readr)
library(NLP)
library(tm)
library(stringr) #increases capabilities to work with strings
library(wordcloud) #creates wordclouds
library(pander)
library(readxl)
library(RColorBrewer) #for color palettes used in wordclouds.
library(DT) #for creating the transcript stats data table.
library(sentimentr)
```

Unlike Python, R, Java and other programming languages, a natural language is a language that has naturally evolved through human history (English, Chinese, Spanish...). Natural Language Processing (NLP) is how computers process and go about these natural languages: basically how to deal with text data. It's a branch of AI that really interests me. In this post, I'll guide you through a basic example of how we can apply NLP techniques to Stand Up Comedy transcripts, using R.

## Background
About five years ago, I was not really into Stand Up comedy; for some reason, most of the performers' sense of humor didn't click with me. But that was until I discovered Sarah Silverman. There was something about her comedy that made me laugh to tears. So in this analysis I want to know: why do I like her so much? Is there something in the language that she uses that makes her different? What makes Sara Silverman's routine stand out?

<center>

![Famous comedian Sarah Silverman.](sarah-silverman.jpg)
</center>

## Data Gathering
We'll need to find the transcript of one of her comedies. My favorite is Speck of Dust so we'll focus on finding that one. The transcript can be found [here](https://scrapsfromtheloft.com/2017/11/10/sarah-silverman-jesus-is-magic-2005-full-transcript/).

In that same link, I found five other famous comedians which we will compare her to. Here is the full compilation of text data we'll use:

1. Ali Wong ~ [Baby Cobra](ali.txt)
2. Bill Burr ~ [I'm Sorry You Feel That Way](bill.txt)
3. Dave Chapelle ~ [The Age of Spin](dave.txt)
4. Hasan Minhaj ~ [Homecoming King](hasan.txt)
5. Louis C.K. ~ [Oh My God](louis.txt)
6. Sarah Silverman ~ [Jesus is Magic](sarah.txt)


## Data Wrangling

For a computer to read a sentence such as "Hello? Is anyone there? I thought this was 543 W 2 S", the text should be cleaned, tokenized, and put into a matrix. So to do this we will:
A. Remove punctuation
B. Remove numbers
C. Lowecase all letters

Our sentence then would look like this: "hello is anyone there i thought this was w s".


```{r tidy_data}
#We put each transcript into one single line.
ali <- paste(readLines('ali.txt'), collapse = " ")
bill <- paste(readLines('bill.txt'), collapse = " ")
dave <- paste(readLines('dave.txt'), collapse = " ")
hasan <- paste(readLines('hasan.txt'), collapse = " ")
louis <- paste(readLines('louis.txt'), collapse = " ")
sarah <- paste(readLines('sarah.txt'), collapse = " ")

#We remove punctuation.
ali <- gsub(pattern = '\\W', replace = ' ', ali)
bill <- gsub(pattern = '\\W', replace = ' ', bill)
dave <- gsub(pattern = '\\W', replace = ' ', dave)
hasan <- gsub(pattern = '\\W', replace = ' ', hasan)
louis <- gsub(pattern = '\\W', replace = ' ', louis)
sarah <- gsub(pattern = '\\W', replace = ' ', sarah)

#We remove numbers.
ali <- gsub(pattern = '\\d', replace = ' ', ali)
bill <- gsub(pattern = '\\d', replace = ' ', bill)
dave <- gsub(pattern = '\\d', replace = ' ', dave)
hasan <- gsub(pattern = '\\d', replace = ' ', hasan)
louis <- gsub(pattern = '\\d', replace = ' ', louis)
sarah <- gsub(pattern = '\\d', replace = ' ', sarah)

#We lowercase everything.
ali <- tolower(ali)
bill <- tolower(bill)
dave <- tolower(dave)
hasan <- tolower(hasan)
louis <- tolower(louis)
sarah <- tolower(sarah)

#We remove stop words (e.g. and, or, the...)
custom_stopwords <- read.csv("stop-word-list.csv", header = FALSE)
custom_stopwords <- as.character(custom_stopwords$V1)
final_stopwords <- c(custom_stopwords, stopwords())

ali <- removeWords(ali, final_stopwords)
bill <- removeWords(bill, final_stopwords)
dave <- removeWords(dave, final_stopwords)
hasan <- removeWords(hasan, final_stopwords)
louis <- removeWords(louis, final_stopwords)
sarah <- removeWords(sarah, final_stopwords)


#Sarah value needs to have "â" removed.
sarah <- gsub(pattern = 'â', replace = ' ', sarah)

#We remove single letter words.
ali <- gsub(pattern = '\\b[a-z]\\b{1}', replace = ' ', ali)
bill <- gsub(pattern = '\\b[a-z]\\b{1}', replace = ' ', bill)
dave <- gsub(pattern = '\\b[a-z]\\b{1}', replace = ' ', dave)
hasan <- gsub(pattern = '\\b[a-z]\\b{1}', replace = ' ', hasan)
louis <- gsub(pattern = '\\b[a-z]\\b{1}', replace = ' ', louis)
sarah <- gsub(pattern = '\\b[a-z]\\b{1}', replace = ' ', sarah)

#Clean up extra white space.
ali <- stripWhitespace(ali)
bill <- stripWhitespace(bill)
dave <- stripWhitespace(dave)
hasan <- stripWhitespace(hasan)
louis <- stripWhitespace(louis)
sarah <- stripWhitespace(sarah)
```

```{r}
#We put our new variables into a table
Comedian <- c("Ali Wong", "Bill Burr", 'Dave Chapelle', 'Hasan Minhaj', 'Louis C.K.', 'Sarah Silverman')
Transcript <- c(ali, bill, dave, hasan, louis, sarah)
clean_transcripts <- data.frame(Comedian, Transcript, stringsAsFactors = FALSE)

#We split the transcript into separate words
ali_bag <- str_split(ali, pattern = "\\s+")
bill_bag <- str_split(bill, pattern = "\\s+")
dave_bag <- str_split(dave, pattern = "\\s+")
hasan_bag <- str_split(hasan, pattern = "\\s+")
louis_bag <- str_split(louis, pattern = "\\s+")
sarah_bag <- str_split(sarah, pattern = "\\s+")

#We unlist the variables above to make a bag of words (single char vectors)
ali_bag <- unlist(ali_bag)
bill_bag <- unlist(bill_bag)
dave_bag <- unlist(dave_bag)
hasan_bag <- unlist(hasan_bag)
louis_bag <- unlist(louis_bag)
sarah_bag <- unlist(sarah_bag)
```


## Top 30 Words of Each Comedian
We'll use word clouds to visualizae the 30 words each comedian used the most in the performance.

```{r plot_data}
par(mfrow = c(1,2))
wordcloud(ali, max.words = 30, random.order = FALSE, colors = rainbow(2))
wordcloud(bill, max.words = 30, random.order = FALSE, colors = rainbow(2))
wordcloud(dave, max.words = 30, random.order = FALSE, colors = rainbow(2))
wordcloud(hasan, max.words = 30, random.order = FALSE, colors = rainbow(2))
wordcloud(louis, max.words = 30, random.order = FALSE, colors = rainbow(2))
wordcloud(sarah, max.words = 30, random.order = FALSE, colors = rainbow(2))
```

#### Findings

1. "God" is a word used by both female comedians but not by the rest of the male comedians.
2. All comedians use a lot of profanity.
3. It's interesting Sarah Silverman uses a lot of adjectives (positive ones) as well as negative words, such as dying, die, bling, crying; maybe this reflect her big use of sarcasm which I think it's funny.
4. It looks like she talks about some controversies such as god, kids, death, sex, disablities. I like people who talks about taboos, especially when it's in a funny way.

## Number of Words

Here the goal is to find how big  of a vocabulary each comedian has. We do this by checking the amount of unique words they use.

```{r}
ali_unique <- n_distinct(ali_bag)
bill_unique <- n_distinct(bill_bag)
dave_unique <- n_distinct(dave_bag)
hasan_unique <- n_distinct(hasan_bag)
louis_unique <- n_distinct(louis_bag)
sarah_unique <- n_distinct(sarah_bag)

ali_total <- length(ali_bag)
bill_total <- length(bill_bag)
dave_total <- length(dave_bag)
hasan_total <- length(hasan_bag)
louis_total <- length(louis_bag)
sarah_total <- length(sarah_bag)

### On the internet, we find out what each performance lasts
ali_time <- 60
bill_time <- 80
dave_time <- 67
hasan_time <- 73
louis_time <- 58
sarah_time <- 72

###Now we can determine the comedians speed(words per min)
ali_speed <- as.integer(ali_total/ali_time)
bill_speed <- as.integer(bill_total/bill_time)
dave_speed <- as.integer(dave_total/dave_time)
hasan_speed <- as.integer(hasan_total/hasan_time)
louis_speed <- as.integer(louis_total/louis_time)
sarah_speed <- as.integer(sarah_total/sarah_time)

##Now let's put all of that in a table
Unique_words <- c(ali_unique, bill_unique, dave_unique, hasan_unique, louis_unique, sarah_unique)
Total_words <- c(ali_total, bill_total, dave_total, hasan_total, louis_total, sarah_total)
Time <- c(ali_time, bill_time, dave_time, hasan_time, louis_time, sarah_time)
Speed <- c(ali_speed, bill_speed, dave_speed, hasan_speed, louis_speed, sarah_speed)
transcripts_stats <- data.frame(Comedian, Unique_words, Total_words, Time, Speed)
datatable(transcripts_stats)
```

```{r}
transcripts_stats <- transcripts_stats %>%
  mutate(highlight_sarah = ifelse(Comedian == 'Sarah Silverman', T, F))

par(mfrow = c(1,2))
ggplot(transcripts_stats, aes(x = reorder(Comedian, -Unique_words), y = Unique_words)) +
  geom_bar(stat="identity", width = 0.5, aes(fill= highlight_sarah)) +
      labs(title="Number of Unique Words") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  theme_classic() +
  ylab('') +
  xlab('Comedian')+
  geom_hline(yintercept=mean(Unique_words), linetype="dashed", color = "black") +
  theme(legend.position = 'none')

ggplot(transcripts_stats, aes(x = reorder(Comedian, -Speed), y = Speed)) +
  geom_bar(stat="identity", width = 0.5, aes(fill= highlight_sarah)) +
      labs(title="Number of Words per Minute") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  theme_classic() +
  ylab('') +
  xlab('Comedian')+
  geom_hline(yintercept=mean(Speed), linetype = "dashed", color = "black")  +
  theme(legend.position = 'none')
```


#### Findings

1. Sarah doesn't use a broad vocabulary (unique words  = 840).
2. She also uses 14 words per minute, which makes her the slowest speaker of all comedians.

These two findings could explain why I like Sarah: it makes sense that me as an English as a Second Languange learner would gravitate towards those who use a simple vocabulary as well as speak slowlier.

## Amount of Profanity
Now we'll look at the amount of profanity the comedians use.

```{r}


ali_bad <- length(which(ali_bag == "fuck")) + length(which(ali_bag == "shit")) + length(which(ali_bag == "fucking"))

bill_bad <- length(which(bill_bag == "fuck")) + length(which(bill_bag == "shit")) + length(which(bill_bag == "fucking"))
dave_bad <- length(which(dave_bag == "fuck")) + length(which(dave_bag == "shit")) + length(which(dave_bag == "fucking"))
hasan_bad <- length(which(hasan_bag == "fuck")) + length(which(hasan_bag == "shit")) + length(which(hasan_bag == "fucking"))
louis_bad <- length(which(louis_bag == "fuck")) + length(which(louis_bag == "shit")) + length(which(louis_bag == "fucking"))
sarah_bad <- length(which(sarah_bag == "fuck")) + length(which(sarah_bag == "shit")) + length(which(sarah_bag == "fucking"))
Swear_words <- c(ali_bad, bill_bad, dave_bad, hasan_bad, louis_bad, sarah_bad)
transcripts_stats <- data.frame(Comedian, Unique_words, Total_words, Time, Swear_words)
datatable(transcripts_stats)
```

```{r}
transcripts_stats <- transcripts_stats %>%
  mutate(highlight_sarah = ifelse(Comedian == 'Sarah Silverman', T, F))
ggplot(transcripts_stats,
       aes(x = Total_words ,
           y = Swear_words,
           color = 'skyblue',
           label = Comedian)) +
  geom_point(aes(
    color = highlight_sarah)) +
  geom_text(aes(label = Comedian,
                color = highlight_sarah),
            hjust = -0.09,
            vjust = 0) +
  theme_bw() +
  ylab('Swear Words') +
  xlab('Total Words') +
  ggtitle('Amount of Profanity') +
  theme(legend.position = 'none')

```

#### Findings

1.Sarah Silverman doesn't use a lot of profanity in her performances (1 swear word every 94 words)

These makes sense since I don't find swear words funny just because it's swear word. If used too much, I may even find it too distracting.

## Sentiment Analysis
Using NLP techniques, we'll discover how positive or negative the comedians are in their performances as well as how opinionated they are.

```{r}
ali_score <- sentiment(ali)
bill_score <- sentiment(bill)
dave_score <- sentiment(dave)
hasan_score <- sentiment(hasan)
louis_score <- sentiment(louis)
sarah_score <- sentiment(sarah)

Sentiment <- c(ali_score$sentiment, bill_score$sentiment,dave_score$sentiment,hasan_score$sentiment,louis_score$sentiment,sarah_score$sentiment)
transcripts_stats <- data.frame(Comedian, Unique_words, Total_words, Time, Swear_words, Sentiment)
transcripts_stats <- transcripts_stats %>%
  mutate(highlight_sarah = ifelse(Comedian == 'Sarah Silverman', T, F))
datatable(transcripts_stats)
```

```{r}
ggplot(transcripts_stats,
       aes(x = Comedian,
           y = Sentiment,
           label= Sentiment)) +
  geom_point(stat = 'identity',
             aes(col = highlight_sarah), size=6)  +
  geom_text(color="white", size=2) +
  labs(title="Sentiment Analysis") +
  ylab("Negative Sentiment <-----------> Positive Sentiment") +
  geom_hline(yintercept = mean(Sentiment), linetype = "dashed", color = "red") +
  coord_flip() +
  theme_bw() +
  theme(legend.position = 'none')
```

#### Findings
1. Sarah scores positive in sentiment analysis but not quite high as Hasan Minhaj.


## Conclusions

After reviewing all visualizations, we conclude a comedian that meets the following criteria will probably be funny to me:

1. Talks about controversial topics such as god and religion, death, sex, race, disabilities.
2. Uses sarcastic language (so a combination of negative and positive words/topics).
3. Doesn't use a broad vocabulary (relatively low count of unique words)
4. Doesn't talk too fast (relatively low number of words per minute)
5. Doesn't use a lot of profanity in the performance.
6. Has a general positive sentiment analysis.

So these are the results (following the order of the list above):

1. 1st place: Louis, 2nd place: Ali
2. 1st place: Louis, 2nd place: Ali
3. 1st place: Louis, 2nd place: Ali
4. 1st place: Louis, 2nd place: Ali
5. 1st place: Ali, 2nd place: Louis
7. 1st place: Louis, 2nd place: Hasan

Therefore, we conclude that Louis C.K. performance might be a great fit.