Skip to content

Commit eb43cc7

Browse files
committed
draft 2 slides
1 parent bf51cf1 commit eb43cc7

10 files changed

Lines changed: 4132 additions & 241 deletions

File tree

README.Rmd

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,7 @@ knitr::opts_chunk$set(
2121
### Install Packages
2222

2323
```
24-
install.packages("janeaustenr")
25-
install.packages("tidyverse")
26-
install.packages("tidytext")
27-
install.packages("wordcloud2")
24+
install.packages(c("tidyverse", "tidytext", "janeaustenr", "wordcloud2"))
2825
```
2926
### Process
3027

README.md

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,17 @@
11
README
22
================
3-
2020-10-30
3+
2021-04-13
44

55
<!-- README.md is generated from README.Rmd. Please edit that file -->
66

77
## Text Mining: Senitment analysis and word clouds
88

99
<!-- badges: start -->
10-
1110
<!-- badges: end -->
1211

1312
### Install Packages
1413

15-
install.packages("janeaustenr")
16-
install.packages("tidyverse")
17-
install.packages("tidytext")
18-
install.packages("wordcloud2")
14+
install.packages(c("tidyverse", "tidytext", "janeaustenr", "wordcloud2"))
1915

2016
### Process
2117

@@ -33,14 +29,14 @@ README
3329

3430
### Resources
3531

36-
- [Tidytext package](https://juliasilge.github.io/tidytext/)
37-
- Book: [Text Mining with R](https://www.tidytextmining.com/) by Silge
32+
- [Tidytext package](https://juliasilge.github.io/tidytext/)
33+
- Book: [Text Mining with R](https://www.tidytextmining.com/) by Silge
3834
and Robinson
39-
- Data Wrangling with dplyr:
40-
([video](https://juliasilge.github.io/tidytext/) |
35+
- Data Wrangling with dplyr:
36+
([video](https://juliasilge.github.io/tidytext/) \|
4137
[workshop](https://rfun.library.duke.edu/portfolio/r_flipped/))
42-
- Data Visualization with ggplot2:
43-
([video](https://warpwire.duke.edu/w/80YEAA/) |
38+
- Data Visualization with ggplot2:
39+
([video](https://warpwire.duke.edu/w/80YEAA/) \|
4440
[workshop](https://rfun.library.duke.edu/portfolio/ggplot_workshop/))
4541

4642
![Word Cloud](images/word_cloud.PNG "Word Cloud")

slides/images/tidy1.svg

Lines changed: 3618 additions & 0 deletions
Loading

slides/images/tidy2.svg

Lines changed: 80 additions & 0 deletions
Loading

slides/index.Rmd

Lines changed: 105 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "<br>Text Mining"
2+
title: "<br>Sentiment Analysis"
33
subtitle: "R case study"
44
author: "John Little"
55
institute: "Cntr for Data & Viz"
@@ -38,9 +38,25 @@ tagList(rmarkdown::html_dependency_font_awesome())
3838
# )
3939
```
4040

41+
## Packages for today
42+
43+
_Sentiment Analysis: R case study_
44+
45+
.bg-washed-blue.b--navy.l-10.t-20pct.w-90.ba.bw2.br3.shadow-5.ph4.mt5[
46+
`install.packages(c("tidyverse", "tidytext",`<br> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;`"janeaustenr", "wordcloud2"))`
47+
]
48+
49+
.l-60.t-80pct.fl-w-third[
50+
```{r echo=FALSE}
51+
countdown::countdown(20)
52+
```
53+
]
54+
55+
---
56+
4157
## Duke University: Land Acknowledgement
4258

43-
.f4[I would like to take a moment to honor the land in Durham, NC. Duke University sits on the ancestral lands of the Shakori, Eno and Catawba people. This institution of higher education is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally this land has borne witness to over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to breakout beyond persistent patterns of colonization and to rewrite the erasure of Indigenous and Black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope we can glimpse an understanding of these histories by recognizing the origins of collective journeys.]
59+
.f3[I would like to take a moment to honor the land in Durham, NC. Duke University sits on the ancestral lands of the Shakori, Eno and Catawba people. This institution of higher education is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally this land has borne witness to over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to breakout beyond persistent patterns of colonization and to rewrite the erasure of Indigenous and Black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope we can glimpse an understanding of these histories by recognizing the origins of collective journeys.]
4460

4561

4662
---
@@ -56,196 +72,173 @@ layout: true
5672

5773
## Demonstration Goals
5874

59-
- Gather some tweets
75+
- Data _cleaning_ & data _wrangling_
76+
77+
- Tokenize corpora (unit of analysis)
6078

61-
- Define APIs and the Twitter Developer portal (Academic Use)
79+
- Visualize word clouds (novelty)
6280

63-
- Rudimentary text analysis and visualization
81+
- Sentiment analysis ()
6482

65-
- Point out useful documentation / resources
83+
- Analyzing word frequencies (tf-idf)
6684

6785

6886
***
6987

7088
.f6.i.moon-gray.center[This is not a text analysis workshop. The foundations of text analysis require considerably more time that we have.
71-
This is a demonstration on leveraging the following tidy packages (tidyverse, and tidytext) and sharing resources. ]
89+
This is a demonstration on leveraging tidy packages (tidyverse and tidytext) and sharing resources. ]
7290

7391

7492
---
7593

7694
class: img-right-full
7795

78-
![](images/attendance.png)
96+
# _Text Mining with R_
7997

80-
# Three tenets
98+
![](https://www.tidytextmining.com/images/cover.png)
8199

100+
#### by Silge & Robinson
82101

83-
- Just numbers
84-
- Benefits of review
85-
- Dashboard fatigue is a real thing
86102

103+
- [www.tidytextmining.com](https://www.tidytextmining.com/)
87104

88-
???
105+
- [juliasilge.github.io/tidytext](https://juliasilge.github.io/tidytext/)
106+
107+
- [github.com/juliasilge/janeaustenr](https://github.com/juliasilge/janeaustenr)
89108

90-
- The implications of dashboard fatigue might be the most interesting thing to discuss in the QA
91109

92110
---
93-
layout: false
94-
class: img-left-full
95111

96-
![](images/by_dept_compare.png)
112+
class: img-left-full
113+
layout: false
97114

98-
## Drivers
115+
![](images/tidy1.svg)
99116

100-
- Goal: create a dashboard of workshop attendance
101-
- CDVS motivated by the possibility of exploring data
102-
- Dashboard can be the basis of another workshop
117+
# Tidy data
103118

104-
.footercc[
105-
<i class="fab fa-creative-commons"></i>&nbsp; <i class="fab fa-creative-commons-by"></i><i class="fab fa-creative-commons-nc"></i> <a href = "https://JohnLittle.info"><span class = "opacity30">https://</span>JohnLittle<span class = "opacity30">.info</span></a>
106-
<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_textmining">https://github.com/libjohn/workshop_textmining</a> | `r Sys.Date()` </span>
107-
]
119+
- Each variable is a column
108120

109-
???
121+
- Each observation is a row
110122

111-
- These are not exactly the best drivers for creating a dashboard. They’re not bad either.
123+
- Each type of observational unit is a table
112124

125+
.footer.center[.f6[-- Wickham 2014 ] ]
113126

114-
---
115-
layout: false
116-
class: middle, center
127+
.footercc[ [Tidy data. Chapter 12. _R for Data Science_](https://r4ds.had.co.nz/tidy-data.html) by Wickham & Grolemund]
117128

118-
<br>
129+
---
119130

120-
.bg-washed-blue.b--navy.ba.bw2.br3.shadow-5.ph4.mt5[
131+
class: img-right-full
121132

122-
![Rfun](images/rfun.png# fl l-4 w-2-12th)
133+
![](images/tidy2.svg)
123134

124-
## John R Little
135+
# Tidy Text format
125136

126-
.prussian[
127-
.f5[Data Science Librarian
128-
Center for Data & Visualization Sciences
129-
Duke University Libraries
130-
]
131-
]
137+
- A token is a meaningful unit of text
132138

133-
.f7[https://johnlittle.info
134-
https://Rfun.library.duke.edu
135-
https://library.duke.edu/data
136-
]
137-
]
139+
- Tokenization is the process of splitting text into tokens `tidytext::unnest_tokens()`
138140

141+
- A table with **one-token-per-row**
139142

143+
.footer[.f6[ &nbsp; -- Silge & Robinson] ]
140144

141-
<i class="fab fa-creative-commons fa-2x"></i> &nbsp; <i class="fab fa-creative-commons-by fa-2x"></i><i class="fab fa-creative-commons-nc fa-2x"></i>
142-
.f6.moon-gray[Creative Commons: Attribution-NonCommercial 4.0]
143-
.f7.moon-gray[https://creativecommons.org/licenses/by-nc/4.0]
144145

145146
---
146-
class: inverse
147+
class: col-2
147148

148-
# Appendix
149+
# Other data structures
149150

150-
## screen shots
151+
#### String
152+
153+
Text / character vectors
154+
155+
156+
#### Corpus
157+
Raw strings annotated with additional metadata
158+
159+
#### Document-term matrix
160+
161+
A sparse matrix describing a collection of documents (i.e. _corpus_) with one row for each document and one column for each term. (tf-idf)
162+
163+
164+
.footer[.f6[ &nbsp; -- Silge & Robinson]]
151165

152166
---
167+
153168
layout: true
154169

155170
.footercc[
156171
<i class="fab fa-creative-commons"></i>&nbsp; <i class="fab fa-creative-commons-by"></i><i class="fab fa-creative-commons-nc"></i> <a href = "https://JohnLittle.info"><span class = "opacity30">https://</span>JohnLittle<span class = "opacity30">.info</span></a>
157172
<span class = "opacity30"> | <a href="https://github.com/libjohn/workshop_textmining">https://github.com/libjohn/workshop_textmining</a> | `r Sys.Date()` </span>
158173
]
159174

160-
---
161-
162-
![Tidyverse](images/tidyverse.png# w-10pct t-1 db fr mr-4)
163-
164-
## Technology stack
165175

176+
---
166177

178+
# Other packages <sup>✤</sup>
167179

168-
- R
169-
- R is a data-first coding language
170-
- R can be a universal interface for analysis and workflow
171-
- Tidyverse is a well developed approach to workflow & the data lifecycle
172-
- Bias towards enabling reproducibility
173-
- scripting
174-
- reporting
175-
180+
- [tm](https://tm.r-forge.r-project.org/) -- _Text Mining Infrastructure in R_
176181

177-
![flexdashboards](images/flexdashboard.png# w-10pct fr fm mr-4)
182+
- [quanteda](https://quanteda.io/) -- _Package for managing and analyzing textual data_
178183

179-
![r logo](images/r_logo.png# fm fr w-10pct mr-4)
184+
- [gutenbergr](https://docs.ropensci.org/gutenbergr/) -- public domain text from **Project Gutenberg**
180185

181-
![rmarkdown](images/rmarkdown.png# w-10pct fr mr-4)
182186

183-
184-
???
187+
.footnote[.small[ ✤ Not covered in this case study]]
185188

186-
- Reuse analysis code to produce reports, email alerts, interactive dashboards, etc.
187189

188190
---
189191

190-
## Lesson
192+
# Further study
191193

192-
.fl-10.w-60.bg.b.ba.bw1.br3.shadow-5.ph4.mt4.center.prussian[The last thing you should do is
193-
build the dashboard
194-
]
194+
Read more of [_Text Mining with R: A Tidy Approach_](https://www.tidytextmining.com)
195195

196-
- Identify target audience and scope
197-
- Create summary reports
198-
- Build a static analysis
199-
- Generate push-reports based on dynamic thresholds
200-
- Advanced: Build a reporting application
196+
1. The tidy text format
197+
2. **Sentiment analysis with tidy data**
198+
3. **Analyzing word and document frequency: tf-idf**
199+
4. Relationships between words: n-grams and correlations
200+
5. Converting to and from non-tidy formats
201+
6. **Topic modeling** (unsupervised classification)
202+
7. Case study: comparing Twitter archives<br>_plus more case studies_
201203

202-
???
203-
Or, in this case, build a workshop attendance application
204204

205205
---
206-
## Other important question(s)
207206

208-
- If developing the dashboard in R...
209-
- Flexdashboard (dashboards)
210-
- Shiny (Web applications)
207+
# Further study
211208

212-
Not mutually exclusive but Flexdashboards has a significantly lower barrier to entry
209+
_Summer Institute for Computational Social Science_
210+
co-founded by [Chris Bail & Matthew Salganik](https://sicss.io/people)
213211

214-
.center[![people](images/happy_people2.jpg# h-10pct w-33pct)]
212+
[SICSS Text Analysis curriculum](https://sicss.io/curriculum)
215213

216214
---
217-
## Actual Goals
218-
219-
- Host **cleaned and disaggregated data**
220-
221-
- Provide a **summary of attendance**
222-
223-
![survey](images/survey_1.png# absolute ofv r-3 w-75pct h-7-12th)
224-
215+
layout: false
216+
class: middle, center
225217

226-
227-
???
218+
<br>
228219

229-
- Host **cleaned and disaggregated data**
230-
- A data archive for clean data
231-
- exported from the SpringShare registration system
232-
- accounts for attendance
233-
- Provide a **summary of attendance** so that staff can
234-
- Assess their workshop’s impact over time (as measured by attendance and registration)
235-
- See current semester attendance totals within the context of multi-year totals
220+
.bg-washed-blue.b--navy.ba.bw2.br3.shadow-5.ph4.mt5[
236221

237-
---
238-
class: center
222+
![Rfun](images/rfun.png# fl l-4 w-2-12th)
239223

240-
![](images/full_attendance.png# l-0 t-0 w-two-thirds h-80pct ofv absolute)
241-
![](images/full_demographics.png# w-third h-40pct t-0 r-0 ofv absolute)
242-
![](images/full_survey.png# w-third h-40pct t-40pct r-0 ofv absolute)
243-
![](images/slice_tables.jpg# l-0 t-80pct w-100pct ofv absolute)
224+
## John R Little
244225

245226
.prussian[
246-
.absolute.w-5-12th.pa-3.l-4-12th.t-8-12th.b.ba.bw-4.br-4.shadow-5.bg-white-80[
247-
Collage of dashboard screens
227+
.f5[Data Science Librarian
228+
Center for Data & Visualization Sciences
229+
Duke University Libraries
248230
]
249231
]
250232

233+
.f7[https://johnlittle.info
234+
https://Rfun.library.duke.edu
235+
https://library.duke.edu/data
236+
]
237+
]
238+
239+
240+
241+
<i class="fab fa-creative-commons fa-2x"></i> &nbsp; <i class="fab fa-creative-commons-by fa-2x"></i><i class="fab fa-creative-commons-nc fa-2x"></i>
242+
.f6.moon-gray[Creative Commons: Attribution-NonCommercial 4.0]
243+
.f7.moon-gray[https://creativecommons.org/licenses/by-nc/4.0]
251244

0 commit comments

Comments
 (0)