Quick Overview

Column 1

Outline

From the social sciences to the humanities and healthcare, much of today’s data is contained in text. However, text is considered to be a type of unstructured information that is difficult to process automatically. Therefore, text mining can be applied to create a more structured representation of a text, making its content more accessible to researchers. Therefore, this course provides a comprehensive introduction to text mining with R. The course has a strong practical focus, and students will gain experience in applying text mining to real data from, for example, social science and healthcare domains, and in interpreting the results. Through lectures and labs, students will learn the skills necessary to design, implement, and understand their own text mining pipeline. Topics covered in this course include regular expressions, text preprocessing, text classification and clustering, and word embedding approaches for text data.

The course deals with the following topics:

  • Understand and explain the fundamental approaches to text mining;
  • Understand and apply current methods for analyzing texts;
  • Understand how text is handled, manipulated, preprocessed and cleaned;
  • Define a text mining pipeline given a practical data science problem;
  • Implement generic text mining tools such as regular expression, text clustering, text classification, sentiment analysis, and word embedding.

The course starts at a very basic level and builds up gradually. By the end of the course, participants will have mastered text mining skills with R.

Requirements

Participants should have a basic knowledge of scripting in R.

Prerequisites

Participants are requested to bring their own laptop for the lab meetings.

Certificate

Participants will receive a certificate at the end of the course.

Additional references

1- Jurafsky, D., Martin, J.H. (2021). Speech and language processing, third edition. Find online chapters here

2- Eisenstein, J. (2018). Natural Language Processing. Find online chapters here

3- Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media, Inc. Find the book here

Column 2

Daily schedule

Start time End time Type
09:00 10:30 Lecture
Break
10:45 11:45 Practical
11:45 12:30 Discussion
Lunch
14:00 15:30 Lecture
Break
15:45 16:30 Practical
16:30 17:00 Discussion

How to Prepare

Column 1

Preparing yourself and your machine for the course

If you have no experience with R or another programming language, you are going to need to catch up before starting the course and during the course.

Some good sources are:

install.packages("swirl")
library(swirl)
swirl()

and follow the guide to run the R Programming: The basics of programming in R interactive course.

System requirements

Bring a laptop computer to the course and make sure that you have full write access and administrator rights to the machine. We will explore programming and compiling in this course. This means that you need full access to your machine. Some corporate laptops come with limited access for their users, we therefore advise you to bring a personal laptop computer, if you have one.

1. Install R

R can be obtained here. We won’t use R directly in the course, but rather call R through RStudio. Therefore it needs to be installed.

2. Install RStudio Desktop

Rstudio is an Integrated Development Environment (IDE). It can be obtained as stand-alone software here. The free and open source RStudio Desktop version is sufficient.

3. Start RStudio and install the following packages.

Execute the following lines of code in the console window:

install.packages(c("ggplot2", "tidyverse", "dplyr", "magrittr", "xlsx", 
                   "wordcloud", "stringr", "caret", "knitr", "rmarkdown", 
                   "plotly", "e1071", "SnowballC", "devtools", "rpart", "proxy",
                   "topicmodels", "tidyr", "dbscan", "text2vec", "tidytext", 
                   "tensorflow", "keras"),
                 dependencies = TRUE)

If you are not sure where to execute code, use the following figure to identify the console: