This dashboard covers the course materials for the course S41: Data Science: Introduction to Text Mining with R.

Instructors:

Study load: 1.5 ECTS

Location: Koningsberger Building, lecture hall Pangea

Quick Overview

Column 1

Outline

From the social sciences to the humanities and healthcare, much of today’s data is contained in text. However, text is considered to be a type of unstructured information that is difficult to process automatically. Therefore, text mining can be applied to create a more structured representation of a text, making its content more accessible to researchers. Therefore, this course provides a comprehensive introduction to text mining with R. The course has a strong practical focus, and students will gain experience in applying text mining to real data from, for example, social science and healthcare domains, and in interpreting the results. Through lectures and labs, students will learn the skills necessary to design, implement, and understand their own text mining pipeline. Topics covered in this course include regular expressions, text preprocessing, text classification and clustering, and word embedding approaches for text data.

The course deals with the following topics:

Understand and explain the fundamental approaches to text mining;
Understand and apply current methods for analyzing texts;
Understand how text is handled, manipulated, preprocessed and cleaned;
Define a text mining pipeline given a practical data science problem;
Implement generic text mining tools such as regular expression, text clustering, text classification, sentiment analysis, and word embedding.

The course starts at a very basic level and builds up gradually. By the end of the course, participants will have mastered text mining skills with R.

Requirements

Participants should have a basic knowledge of scripting in R.

Prerequisites

Participants are requested to bring their own laptop for the lab meetings.

Certificate

Participants will receive a certificate at the end of the course.

Additional references

1- Jurafsky, D., Martin, J.H. (2021). Speech and language processing, third edition. Find online chapters here

2- Eisenstein, J. (2018). Natural Language Processing. Find online chapters here

3- Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media, Inc. Find the book here

Column 2

Daily schedule

Start time	End time	Type
09:00	10:30	Lecture
	Break
10:45	11:45	Practical
11:45	12:30	Discussion
	Lunch
14:00	15:30	Lecture
	Break
15:45	16:30	Practical
16:30	17:00	Discussion

How to Prepare

Column 1

Preparing yourself and your machine for the course

If you have no experience with R or another programming language, you are going to need to catch up before starting the course and during the course.

Some good sources are:

The first two chapters of introduction to R on datacamp
Install R, play around, and read the workflow basics chapter in Hadley Wickham’s R for Data Science
Interactive R course: install R as in the previous point and in the console type the following lines one by one

install.packages("swirl")
library(swirl)
swirl()

and follow the guide to run the R Programming: The basics of programming in R interactive course.

System requirements

Bring a laptop computer to the course and make sure that you have full write access and administrator rights to the machine. We will explore programming and compiling in this course. This means that you need full access to your machine. Some corporate laptops come with limited access for their users, we therefore advise you to bring a personal laptop computer, if you have one.

1. Install `R`

R can be obtained here. We won’t use R directly in the course, but rather call R through RStudio. Therefore it needs to be installed.

2. Install `RStudio` Desktop

Rstudio is an Integrated Development Environment (IDE). It can be obtained as stand-alone software here. The free and open source RStudio Desktop version is sufficient.

3. Start RStudio and install the following packages.

Execute the following lines of code in the console window:

install.packages(c("ggplot2", "tidyverse", "dplyr", "magrittr", "xlsx", 
                   "wordcloud", "stringr", "caret", "knitr", "rmarkdown", 
                   "plotly", "e1071", "SnowballC", "devtools", "rpart", "proxy",
                   "topicmodels", "tidyr", "dbscan", "text2vec", "tidytext", 
                   "tensorflow", "keras"),
                 dependencies = TRUE)

If you are not sure where to execute code, use the following figure to identify the console:

Just copy and paste the installation command and press the return key. When asked

Do you want to install from sources the package which needs 
compilation? (Yes/no/cancel)

type Yes in the console and press the return key.

Required `R` knowledge

The following is the minimum of what you should know about R before starting with the first practical

What is R (a fancy calculator) and what is an .R file (a recipe for calculations)
What is an R package (a set of functions you can download to use in your own code)
How to run R code in RStudio
What is a variable x <- 10
What is a function y <- fun(x = 10)
Understand what the following statements do (tip: you may run it in R line by line)

y <- "What?"
x <- "R!"
z <- paste(x, "No, text mining is the best.", y)
rep(z, 3)
1:10
sample(1:20, 4)
sample(1:20, 40, replace = TRUE)
z <- c(1, 2, 3, 4, 5, 4, 3, 2, 1)
z^2
z == 2
z > 2
install.packages("dplyr")
library(dplyr)

Be able to read the help file of any function, (e.g., type ?plot in the console)

Column 2

What if the steps to the left do not work for me?

If all fails and you have insufficient rights to your machine, the following web-based service will offer a solution.

Open a free account on rstudio.cloud. You can run your own cloud-based RStudio environment there. Naturally, you will need internet access for these services to be accessed.

Monday

Column 1

Materials

We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.

Here you will find the materials for Monday:

Part 1: Introduction
Part 2: Text preprocessing

Column 2

Additional references

Chapters 1, 2, 3 of Ref 1
Chapter 1 of Ref 2
Chapters 1, 3, 4, 5 of Ref 3

Tuesday

Column 1

Materials

We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.

Here you will find the materials for Tuesday:

Part 3: Text representation & classification
Part 4: Sentiment analysis

Column 2

Additional references

Chapters 4, 5, 20 of Ref 1
Chapters 2, 3, 4 of Ref 2
Chapter 2 of Ref 3

Wednesday

Column 1

Materials

We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.

Here you will find the materials for Wednesday:

Part 5: Feature selection & text clustering
Part 6: Topic modeling

Column 2

Additional references

Chapters 6 and 7 of Ref 1
Chapters 5 and 14 of Ref 2
Chapter 6 of Ref 3

Thursday

Column 1

Materials

We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.

Here you will find the materials for Thursday:

Part 7: Word embedding
Part 8: Deep learning for text

Column 2

Additional references

Chapters 4, 5, 20 of Ref 1
Chapters 2, 3, 4 of Ref 2
Chapter 2 of Ref 3

Intro

Quick Overview

Column 1

Outline

Requirements

Prerequisites

Certificate

Additional references

Column 2

Daily schedule

How to Prepare

Column 1

Preparing yourself and your machine for the course

System requirements

1. Install R

2. Install RStudio Desktop

3. Start RStudio and install the following packages.

Required R knowledge

Column 2

What if the steps to the left do not work for me?

Monday

Column 1

Materials

Column 2

Additional references

Tuesday

Column 1

Materials

Column 2

Additional references

Wednesday

Column 1

Materials

Column 2

Additional references

Thursday

Column 1

Materials

Column 2

Additional references

Archive

1. Install `R`

2. Install `RStudio` Desktop

Required `R` knowledge