11---
2- title : " <br>Text Mining "
2+ title : " <br>Sentiment Analysis "
33subtitle : " R case study"
44author : " John Little"
55institute : " Cntr for Data & Viz"
@@ -38,9 +38,25 @@ tagList(rmarkdown::html_dependency_font_awesome())
3838# )
3939```
4040
41+ ## Packages for today
42+
43+ _ Sentiment Analysis: R case study_
44+
45+ .bg-washed-blue.b--navy.l-10.t-20pct.w-90.ba.bw2.br3.shadow-5.ph4.mt5[
46+ ` install.packages(c("tidyverse", "tidytext", ` <br >   ;   ;   ;   ;   ;   ;   ;   ;   ;   ;   ;   ;   ;   ;   ;   ;   ; ` "janeaustenr", "wordcloud2")) `
47+ ]
48+
49+ .l-60.t-80pct.fl-w-third[
50+ ``` {r echo=FALSE}
51+ countdown::countdown(20)
52+ ```
53+ ]
54+
55+ ---
56+
4157## Duke University: Land Acknowledgement
4258
43- .f4 [ I would like to take a moment to honor the land in Durham, NC. Duke University sits on the ancestral lands of the Shakori, Eno and Catawba people. This institution of higher education is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally this land has borne witness to over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to breakout beyond persistent patterns of colonization and to rewrite the erasure of Indigenous and Black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope we can glimpse an understanding of these histories by recognizing the origins of collective journeys.]
59+ .f3 [ I would like to take a moment to honor the land in Durham, NC. Duke University sits on the ancestral lands of the Shakori, Eno and Catawba people. This institution of higher education is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally this land has borne witness to over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to breakout beyond persistent patterns of colonization and to rewrite the erasure of Indigenous and Black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope we can glimpse an understanding of these histories by recognizing the origins of collective journeys.]
4460
4561
4662---
@@ -56,196 +72,173 @@ layout: true
5672
5773## Demonstration Goals
5874
59- - Gather some tweets
75+ - Data _ cleaning_ & data _ wrangling_
76+
77+ - Tokenize corpora (unit of analysis)
6078
61- - Define APIs and the Twitter Developer portal (Academic Use )
79+ - Visualize word clouds (novelty )
6280
63- - Rudimentary text analysis and visualization
81+ - Sentiment analysis ()
6482
65- - Point out useful documentation / resources
83+ - Analyzing word frequencies (tf-idf)
6684
6785
6886***
6987
7088.f6.i.moon-gray.center[ This is not a text analysis workshop. The foundations of text analysis require considerably more time that we have.
71- This is a demonstration on leveraging the following tidy packages (tidyverse, and tidytext) and sharing resources. ]
89+ This is a demonstration on leveraging tidy packages (tidyverse and tidytext) and sharing resources. ]
7290
7391
7492---
7593
7694class: img-right-full
7795
78- ![ ] ( images/attendance.png )
96+ # _ Text Mining with R _
7997
80- # Three tenets
98+ ![ ] ( https://www.tidytextmining.com/images/cover.png )
8199
100+ #### by Silge & Robinson
82101
83- - Just numbers
84- - Benefits of review
85- - Dashboard fatigue is a real thing
86102
103+ - [ www.tidytextmining.com ] ( https://www.tidytextmining.com/ )
87104
88- ???
105+ - [ juliasilge.github.io/tidytext] ( https://juliasilge.github.io/tidytext/ )
106+
107+ - [ github.com/juliasilge/janeaustenr] ( https://github.com/juliasilge/janeaustenr )
89108
90- - The implications of dashboard fatigue might be the most interesting thing to discuss in the QA
91109
92110---
93- layout: false
94- class: img-left-full
95111
96- ![ ] ( images/by_dept_compare.png )
112+ class: img-left-full
113+ layout: false
97114
98- ## Drivers
115+ ![ ] ( images/tidy1.svg )
99116
100- - Goal: create a dashboard of workshop attendance
101- - CDVS motivated by the possibility of exploring data
102- - Dashboard can be the basis of another workshop
117+ # Tidy data
103118
104- .footercc[
105- <i class =" fab fa-creative-commons " ></i >  ; <i class =" fab fa-creative-commons-by " ></i ><i class =" fab fa-creative-commons-nc " ></i > <a href = " https://JohnLittle.info " ><span class = " opacity30 " >https://</span >JohnLittle<span class = " opacity30 " >.info</span ></a >
106- <span class = " opacity30 " > | <a href =" https://github.com/libjohn/workshop_textmining " >https://github.com/libjohn/workshop_textmining </a > | ` r Sys.Date() ` </span >
107- ]
119+ - Each variable is a column
108120
109- ???
121+ - Each observation is a row
110122
111- - These are not exactly the best drivers for creating a dashboard. They’re not bad either.
123+ - Each type of observational unit is a table
112124
125+ .footer.center[ .f6[ -- Wickham 2014 ] ]
113126
114- ---
115- layout: false
116- class: middle, center
127+ .footercc[ [ Tidy data. Chapter 12. _ R for Data Science_ ] ( https://r4ds.had.co.nz/tidy-data.html ) by Wickham & Grolemund]
117128
118- < br >
129+ ---
119130
120- .bg-washed-blue.b--navy.ba.bw2.br3.shadow-5.ph4.mt5 [
131+ class: img-right-full
121132
122- ![ Rfun ] (images/rfun.png# fl l-4 w-2-12th )
133+ ![ ] ( images/tidy2.svg )
123134
124- ## John R Little
135+ # Tidy Text format
125136
126- .prussian[
127- .f5[ Data Science Librarian
128- Center for Data & Visualization Sciences
129- Duke University Libraries
130- ]
131- ]
137+ - A token is a meaningful unit of text
132138
133- .f7[ https://johnlittle.info
134- https://Rfun.library.duke.edu
135- https://library.duke.edu/data
136- ]
137- ]
139+ - Tokenization is the process of splitting text into tokens ` tidytext::unnest_tokens() `
138140
141+ - A table with ** one-token-per-row**
139142
143+ .footer[ .f6[   ; -- Silge & Robinson] ]
140144
141- <i class =" fab fa-creative-commons fa-2x " ></i >   ; <i class =" fab fa-creative-commons-by fa-2x " ></i ><i class =" fab fa-creative-commons-nc fa-2x " ></i >
142- .f6.moon-gray[ Creative Commons: Attribution-NonCommercial 4.0]
143- .f7.moon-gray[ https://creativecommons.org/licenses/by-nc/4.0 ]
144145
145146---
146- class: inverse
147+ class: col-2
147148
148- # Appendix
149+ # Other data structures
149150
150- ## screen shots
151+ #### String
152+
153+ Text / character vectors
154+
155+
156+ #### Corpus
157+ Raw strings annotated with additional metadata
158+
159+ #### Document-term matrix
160+
161+ A sparse matrix describing a collection of documents (i.e. _ corpus_ ) with one row for each document and one column for each term. (tf-idf)
162+
163+
164+ .footer[ .f6[   ; -- Silge & Robinson]]
151165
152166---
167+
153168layout: true
154169
155170.footercc[
156171<i class =" fab fa-creative-commons " ></i >  ; <i class =" fab fa-creative-commons-by " ></i ><i class =" fab fa-creative-commons-nc " ></i > <a href = " https://JohnLittle.info " ><span class = " opacity30 " >https://</span >JohnLittle<span class = " opacity30 " >.info</span ></a >
157172<span class = " opacity30 " > | <a href =" https://github.com/libjohn/workshop_textmining " >https://github.com/libjohn/workshop_textmining </a > | ` r Sys.Date() ` </span >
158173]
159174
160- ---
161-
162- ![ Tidyverse] (images/tidyverse.png# w-10pct t-1 db fr mr-4)
163-
164- ## Technology stack
165175
176+ ---
166177
178+ # Other packages <sup >✤</sup >
167179
168- - R
169- - R is a data-first coding language
170- - R can be a universal interface for analysis and workflow
171- - Tidyverse is a well developed approach to workflow & the data lifecycle
172- - Bias towards enabling reproducibility
173- - scripting
174- - reporting
175-
180+ - [ tm] ( https://tm.r-forge.r-project.org/ ) -- _ Text Mining Infrastructure in R_
176181
177- ![ flexdashboards ] (images/flexdashboard.png# w-10pct fr fm mr-4)
182+ - [ quanteda ] ( https://quanteda.io/ ) -- _ Package for managing and analyzing textual data _
178183
179- ![ r logo ] (images/r_logo.png# fm fr w-10pct mr-4)
184+ - [ gutenbergr ] ( https://docs.ropensci.org/gutenbergr/ ) -- public domain text from ** Project Gutenberg **
180185
181- ![ rmarkdown] (images/rmarkdown.png# w-10pct fr mr-4)
182186
183-
184- ???
187+ .footnote[ .small[ ✤ Not covered in this case study]]
185188
186- - Reuse analysis code to produce reports, email alerts, interactive dashboards, etc.
187189
188190---
189191
190- ## Lesson
192+ # Further study
191193
192- .fl-10.w-60.bg.b.ba.bw1.br3.shadow-5.ph4.mt4.center.prussian[ The last thing you should do is
193- build the dashboard
194- ]
194+ Read more of [ _ Text Mining with R: A Tidy Approach_ ] ( https://www.tidytextmining.com )
195195
196- - Identify target audience and scope
197- - Create summary reports
198- - Build a static analysis
199- - Generate push-reports based on dynamic thresholds
200- - Advanced: Build a reporting application
196+ 1 . The tidy text format
197+ 2 . ** Sentiment analysis with tidy data**
198+ 3 . ** Analyzing word and document frequency: tf-idf**
199+ 4 . Relationships between words: n-grams and correlations
200+ 5 . Converting to and from non-tidy formats
201+ 6 . ** Topic modeling** (unsupervised classification)
202+ 7 . Case study: comparing Twitter archives<br >_ plus more case studies_
201203
202- ???
203- Or, in this case, build a workshop attendance application
204204
205205---
206- ## Other important question(s)
207206
208- - If developing the dashboard in R...
209- - Flexdashboard (dashboards)
210- - Shiny (Web applications)
207+ # Further study
211208
212- Not mutually exclusive but Flexdashboards has a significantly lower barrier to entry
209+ _ Summer Institute for Computational Social Science_
210+ co-founded by [ Chris Bail & Matthew Salganik] ( https://sicss.io/people )
213211
214- .center [ ![ people ] (images/happy_people2.jpg# h-10pct w-33pct) ]
212+ [ SICSS Text Analysis curriculum ] ( https://sicss.io/curriculum )
215213
216214---
217- ## Actual Goals
218-
219- - Host ** cleaned and disaggregated data**
220-
221- - Provide a ** summary of attendance**
222-
223- ![ survey] (images/survey_1.png# absolute ofv r-3 w-75pct h-7-12th)
224-
215+ layout: false
216+ class: middle, center
225217
226-
227- ???
218+ <br >
228219
229- - Host ** cleaned and disaggregated data**
230- - A data archive for clean data
231- - exported from the SpringShare registration system
232- - accounts for attendance
233- - Provide a ** summary of attendance** so that staff can
234- - Assess their workshop’s impact over time (as measured by attendance and registration)
235- - See current semester attendance totals within the context of multi-year totals
220+ .bg-washed-blue.b--navy.ba.bw2.br3.shadow-5.ph4.mt5[
236221
237- ---
238- class: center
222+ ![ Rfun] (images/rfun.png# fl l-4 w-2-12th)
239223
240- ![ ] (images/full_attendance.png# l-0 t-0 w-two-thirds h-80pct ofv absolute)
241- ![ ] (images/full_demographics.png# w-third h-40pct t-0 r-0 ofv absolute)
242- ![ ] (images/full_survey.png# w-third h-40pct t-40pct r-0 ofv absolute)
243- ![ ] (images/slice_tables.jpg# l-0 t-80pct w-100pct ofv absolute)
224+ ## John R Little
244225
245226.prussian[
246- .absolute.w-5-12th.pa-3.l-4-12th.t-8-12th.b.ba.bw-4.br-4.shadow-5.bg-white-80[
247- Collage of dashboard screens
227+ .f5[ Data Science Librarian
228+ Center for Data & Visualization Sciences
229+ Duke University Libraries
248230]
249231]
250232
233+ .f7[ https://johnlittle.info
234+ https://Rfun.library.duke.edu
235+ https://library.duke.edu/data
236+ ]
237+ ]
238+
239+
240+
241+ <i class =" fab fa-creative-commons fa-2x " ></i >   ; <i class =" fab fa-creative-commons-by fa-2x " ></i ><i class =" fab fa-creative-commons-nc fa-2x " ></i >
242+ .f6.moon-gray[ Creative Commons: Attribution-NonCommercial 4.0]
243+ .f7.moon-gray[ https://creativecommons.org/licenses/by-nc/4.0 ]
251244
0 commit comments