Quick Overview

Column 1

Outline

From the social sciences to the humanities and healthcare, much of today’s data is contained in text. However, text is considered to be a type of unstructured information that is difficult to process automatically. Therefore, text mining can be applied to create a more structured representation of a text, making its content more accessible to researchers. Therefore, this course provides a comprehensive introduction to text mining with R. The course has a strong practical focus, and students will gain experience in applying text mining to real data from, for example, social science and healthcare domains, and in interpreting the results. Through lectures and labs, students will learn the skills necessary to design, implement, and understand their own text mining pipeline. Topics covered in this course include regular expressions, text preprocessing, text classification and clustering, and word embedding approaches for text data.

The course deals with the following topics:

  • Understand and explain the fundamental approaches to text mining;
  • Understand and apply current methods for analyzing texts;
  • Understand how text is handled, manipulated, preprocessed and cleaned;
  • Define a text mining pipeline given a practical data science problem;
  • Implement generic text mining tools such as regular expression, text clustering, text classification, sentiment analysis, and word embedding.

The course starts at a very basic level and builds up gradually. By the end of the course, participants will have mastered text mining skills with R.

Requirements

Participants should have a basic knowledge of scripting in R.

Prerequisites

Participants are requested to bring their own laptop for the lab meetings.

Certificate

Participants will receive a certificate at the end of the course.

Additional references

1- Jurafsky, D., Martin, J.H. (2024). Speech and language processing, third edition. Find online chapters here

2- Eisenstein, J. (2018). Natural Language Processing. Find online chapters here

3- Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media, Inc. Find the book here

Column 2

Daily schedule

Start time End time Type
09:00 10:30 Lecture
Break
10:45 11:45 Practical
11:45 12:30 Discussion
Lunch
14:00 15:30 Lecture
Break
15:45 16:30 Practical
16:30 17:00 Discussion

How to Prepare

Column 1

Preparing yourself and your machine for the course

If you have no experience with R or another programming language, you are going to need to catch up before starting the course and during the course.

Some good sources are:

install.packages("swirl")
library(swirl)
swirl()

and follow the guide to run the R Programming: The basics of programming in R interactive course.

System requirements

Bring a laptop computer to the course and make sure that you have full write access and administrator rights to the machine. We will explore programming and compiling in this course. This means that you need full access to your machine. Some corporate laptops come with limited access for their users, we therefore advise you to bring a personal laptop computer, if you have one.

1. Install R

R can be obtained here. We won’t use R directly in the course, but rather call R through RStudio. Therefore it needs to be installed.

2. Install RStudio Desktop

Rstudio is an Integrated Development Environment (IDE). It can be obtained as stand-alone software here. The free and open source RStudio Desktop version is sufficient.

3. Start RStudio and install the following packages.

Execute the following lines of code in the console window:

install.packages(c("ggplot2", "tidyverse", "dplyr", "magrittr", "xlsx", 
                   "wordcloud", "stringr", "caret", "knitr", "rmarkdown", 
                   "plotly", "e1071", "SnowballC", "devtools", "rpart", "proxy",
                   "topicmodels", "tidyr", "dbscan", "text2vec", "tidytext", 
                   "tensorflow", "keras"),
                 dependencies = TRUE)

If you are not sure where to execute code, use the following figure to identify the console:

HTML5 Icon

Just copy and paste the installation command and press the return key. When asked

Do you want to install from sources the package which needs 
compilation? (Yes/no/cancel)

type Yes in the console and press the return key.

Required R knowledge

The following is the minimum of what you should know about R before starting with the first practical

  • What is R (a fancy calculator) and what is an .R file (a recipe for calculations)
  • What is an R package (a set of functions you can download to use in your own code)
  • How to run R code in RStudio
  • What is a variable x <- 10
  • What is a function y <- fun(x = 10)
  • Understand what the following statements do (tip: you may run it in R line by line)
y <- "What?"
x <- "R!"
z <- paste(x, "No, text mining is the best.", y)
rep(z, 3)
1:10
sample(1:20, 4)
sample(1:20, 40, replace = TRUE)
z <- c(1, 2, 3, 4, 5, 4, 3, 2, 1)
z^2
z == 2
z > 2
install.packages("dplyr")
library(dplyr)
  • Be able to read the help file of any function, (e.g., type ?plot in the console)

Column 2

What if the steps to the left do not work for me?

If all fails and you have insufficient rights to your machine, the following web-based service will offer a solution.

  • Open a free account on rstudio.cloud. You can run your own cloud-based RStudio environment there. Naturally, you will need internet access for these services to be accessed.

Monday

Column 1

Materials

We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.

Here you will find the materials for Monday:

Column 2

Additional references

  • Chapters 1, 2, 3 of Ref 1
  • Chapter 1 of Ref 2
  • Chapters 1, 3, 4, 5 of Ref 3

Tuesday

Column 1

Materials

We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.

Here you will find the materials for Tuesday:

Column 2

Additional references

  • Chapters 4, 5, 20 of Ref 1
  • Chapters 2, 3, 4 of Ref 2
  • Chapter 2 of Ref 3

Wednesday

Column 1

Materials

We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.

Here you will find the materials for Wednesday:

Column 2

Additional references

  • Chapters 6 and 7 of Ref 1
  • Chapters 5 and 14 of Ref 2
  • Chapter 6 of Ref 3

Thursday

Column 1

Materials

We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.

Here you will find the materials for Thursday:

Column 2

Additional references

  • Chapters 4, 5, 20 of Ref 1
  • Chapters 2, 3, 4 of Ref 2
  • Chapter 2 of Ref 3

Archive

On the last day of the course, all the materials will be available in a compact file for download:

Download the Materials

We wish all the participants success with their Text Mining projects!