Note: I will keep updating this page until the day of the workshop…
Requirements for attendance
- Bring your own laptop;
- Preinstall Tableau (if you are not elegible for a student licence, activate your 14-day trial);
- Familiarise yourself with Tableau (start here and here);
- Preinstall R and RStudio (instructions can be found here);
- Familiarise yourself with R and RStudio (you can start by reading the ‘The R User Interface’ section here);
- Install and try to load all the following R packages: ‘stm’, ‘tidyverse’, ‘tidytext’, ‘tidytext’, ‘xml2’, ‘pdftools’, ‘stringr’, ‘gutenbergr’, ‘jsonlite’, ‘tsne’. If you don’t know how to install a package in R, have a look here.
Suggested pre-readings
- Kong, Q., Booth, E., Bailo, F., Johns, A., & Rizoiu, M.-A. (2022). Slipping to the Extreme: A Mixed Method to Explain How Extreme Opinions Infiltrate Online Discussions. Proceedings of the International AAAI Conference on Web and Social Media, 16(1), 524–535. https://ojs.aaai.org/index.php/ICWSM/article/view/19312
This is a paper I co-authored. I will discuss it during the workshop so you might want to have a look! Warning Mathematical notation is used in the article: Reader discretion is advised.
- Robinson, J. S. D. (2017). Text mining with R: A tidy approach. Sebastopol, CA: O’Reilly. tidytextmining.com.
This is of course a reference text. I would suggest you read the prefance and skim through the rest of the chapters.
- Welbers, K., Van Atteveldt, W., & Benoit, K. (2017). Text analysis in R. Communication Methods and Measures, 11(4), 245–265. doi.org/10.1080/19312458.2017.1387238
This article gives you a vert good sense of the most common steps and operations in a computational text analysis’. The article offers plenty of R code snippets. You are not required to replicate these steps on your computer - but it might be helpful to try to understand what they do :-)
Additional readings on R and text analysis
- Wickham, H., & Grolemund, G. (2017). R for data science. Sebastopol, CA: O’Reilly. r4ds.had.co.nz and tidyverse.org
- Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York, NY: Cambridge University Press. nlp.stanford.edu/IR-book
Date and Location
- 7 July 2022
- Room 5B55a and Room 5B55b, The University of Canberra
Data
To download the data
- “View the project on GitHub” (left column on this page),
- Click on “Code”,
- then “Download ZIP”.
- Open the ZIP file and navigate to the “data” folder.
- Twitter data only Download the data files contained in this share folder (to request access the folder you need a Google account).
Workshop notes
Natural language processing
The widespread approach (research and industry) is to divide our text (aka corpus) in documents. There is no one way to do it. But documents. Documents tell our models how to make sense of the relations among words (aka terms, ≈ tokens).
- Corpus V
- Documents V
- Terms
- Terms are not distributed randomly within a document;
- Documents are not distributed randomly within a corpus.
Research use
- Classify: Clustering “similar” documents together (measuring similarity)
- Discovery: Finding “similar” documents
Industry use
In addition to the above…
- Predict/Complete: Predict what is the next word/sentence
- Translate: Predict what’s the best sentence to translate a sentence between two languages
Common NPL approaches (Welbers et al., 2017)
- Counting (from a dictionary)
- Supervised machine learning
- Unsupervised machine learning
Common NLP techniques
- Bag-of-words
- Lemmatisation (replace words with their lemmas)
- Part-of-speech tagging
- Named entity recognition
- Word positions and syntax
If you have any issue before or after the workshop: francesco.bailo@uts.edu.au.