From the social sciences to the humanities and healthcare, much of today’s data is contained in text. However, text is considered to be a type of unstructured information that is difficult to process automatically. Therefore, text mining can be applied to create a more structured representation of a text, making its content more accessible to researchers. Therefore, this course provides a comprehensive introduction to text mining with R. The course has a strong practical focus, and students will gain experience in applying text mining to real data from, for example, social science and healthcare domains, and in interpreting the results. Through lectures and labs, students will learn the skills necessary to design, implement, and understand their own text mining pipeline. Topics covered in this course include regular expressions, text preprocessing, text classification and clustering, and word embedding approaches for text data.
The course deals with the following topics:
The course starts at a very basic level and builds up gradually. By the end of the course, participants will have mastered text mining skills with R.
Participants should have a basic knowledge of scripting in R.
Participants are requested to bring their own laptop for the lab meetings.
Participants will receive a certificate at the end of the course.
1- Jurafsky, D., Martin, J.H. (2024). Speech and language processing, third edition. Find online chapters here
2- Eisenstein, J. (2018). Natural Language Processing. Find online chapters here
3- Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media, Inc. Find the book here
Start time | End time | Type |
---|---|---|
09:00 | 10:30 | Lecture |
Break | ||
10:45 | 11:45 | Practical |
11:45 | 12:30 | Discussion |
Lunch | ||
14:00 | 15:30 | Lecture |
Break | ||
15:45 | 16:30 | Practical |
16:30 | 17:00 | Discussion |
If you have no experience with R
or another programming
language, you are going to need to catch up before starting the course
and during the course.
Some good sources are:
R
, play around, and read
the workflow basics chapter in Hadley Wickham’s R for Data
ScienceR
as in the previous
point and in the console type the following lines one by oneand follow the guide to run the
R Programming: The basics of programming in R
interactive
course.
Bring a laptop computer to the course and make sure that you have full write access and administrator rights to the machine. We will explore programming and compiling in this course. This means that you need full access to your machine. Some corporate laptops come with limited access for their users, we therefore advise you to bring a personal laptop computer, if you have one.
R
R
can be obtained here. We won’t use R
directly in the course, but rather call R
through
RStudio
. Therefore it needs to be installed.
RStudio
DesktopRstudio is an Integrated Development Environment (IDE). It can be
obtained as stand-alone software here.
The free and open source RStudio Desktop
version is
sufficient.
Execute the following lines of code in the console window:
install.packages(c("ggplot2", "tidyverse", "dplyr", "magrittr", "xlsx",
"wordcloud", "stringr", "caret", "knitr", "rmarkdown",
"plotly", "e1071", "SnowballC", "devtools", "rpart", "proxy",
"topicmodels", "tidyr", "dbscan", "text2vec", "tidytext",
"tensorflow", "keras"),
dependencies = TRUE)
If you are not sure where to execute code, use the following figure to identify the console:
Just copy and paste the installation command and press the return key. When asked
type Yes
in the console and press the return key.
R
knowledgeThe following is the minimum of what you should know about
R
before starting with the first practical
R
(a fancy calculator) and what is an
.R
file (a recipe for calculations)R
package (a set of functions you can
download to use in your own code)R
code in RStudio
x <- 10
y <- fun(x = 10)
R
line by line)y <- "What?"
x <- "R!"
z <- paste(x, "No, text mining is the best.", y)
rep(z, 3)
1:10
sample(1:20, 4)
sample(1:20, 40, replace = TRUE)
z <- c(1, 2, 3, 4, 5, 4, 3, 2, 1)
z^2
z == 2
z > 2
install.packages("dplyr")
library(dplyr)
?plot
in the console)If all fails and you have insufficient rights to your machine, the following web-based service will offer a solution.
RStudio
environment there. Naturally, you will
need internet access for these services to be accessed.We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.
Here you will find the materials for Monday:
We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.
Here you will find the materials for Tuesday:
We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.
Here you will find the materials for Wednesday:
We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.
Here you will find the materials for Thursday:
On the last day of the course, all the materials will be available in a compact file for download:
We wish all the participants success with their Text Mining projects!