Quick Overview

Column 1

Outline

Given the rapid rate at which text data are being digitally gathered in many domains of science, there is growing need for automated tools that can analyse, classify, and interpret this kind of data. Text mining techniques can be applied to create a structured representation of text, making its content more accessible for researchers. Applications of text mining are everywhere: social media, web search, advertising, emails, customer service, healthcare, marketing, etc. This course offers an extensive exploration into text mining with Python. The course has a strongly practical hands-on focus, and students will gain experience in using text mining on real data from for example social sciences and healthcare and interpreting the results. Through lectures and practicals, the students will learn the necessary skills to design, implement, and understand their own text mining pipeline. The topics in this course include preprocessing text, text classification, topic modeling, word embedding, deep learning models, and responsible text mining.

The course deals with the following topics:

  • Review the fundamental approaches to text mining;
  • Understand and apply current methods for analysing texts;
  • Define a text mining pipeline given a practical data science problem;
  • Implement all steps in a text mining pipeline: feature extraction, feature selection, model learning, model evaluation;
  • Understand and apply state-of-the-art methods in text mining;
  • Implement word embedding and advanced deep learning techniques.

The course starts with reviewing basic concepts of text mining and implementing advanced concepts in natural language processing. At the end of the week, participants will master advanced skills of text mining with Python.

Requirements

Participants should have a basic knowledge of data science and programming and a motivation of scripting and programming in Python.

Prerequisites

Participants are requested to bring their own laptop for the lab meetings.

Certificate

Participants will receive a certificate at the end of the course.

Additional references

1- Jurafsky, D., Martin, J.H. (2021). Speech and language processing, third edition. Find online chapters here

2- Eisenstein, J. (2018). Natural Language Processing. Find online chapters here

Column 2

Daily schedule

Start time End time Type
09:00 10:30 Lecture
Break
10:45 11:45 Practical
11:45 12:15 Discussion
Lunch
13:45 15:15 Lecture
Break
15:30 16:30 Practical
16:30 17:00 Discussion

How to prepare

Column 1

Preparing your machine for the course

Dear all,

This summer you will participate in the S42: Data Science: Applied Text Mining course in Utrecht, the Netherlands. To realize a steeper learning curve, we will use some functionality from the Python programming language using Google Colab. The below steps guide you through how to use Python and work on the practicals in this course.

If you follow this course online please have a look at this instructional page on MS Teams.

We look forward to see you all in Utrecht and online.

The Applied Text Mining team

System requirements

Bring a laptop computer to the course and make sure that you have an Internet connection to be able to use Python in Google Colab. If you are using PyCharm or Jupyte Notebook, also check that you have full write access and administrator rights to the machine. We will explore programming and compiling in this course. Some corporate laptops come with limited access for their users, we therefore advice you to bring a personal laptop computer, if you have one.

Python in Google Colab

Python is a general-purpose interpreted, interactive, object-oriented, and high-level programming language. It is a powerful environment for scientific computing.

We expect that many of you will have some experience with Python; for the rest of you, this section will serve as a quick crash course both on the Python programming language and on the use of Python in Google Colab:

Follow the tutorial on Python in Google Colab for the Applied Text Mining course from here.

This tutorial is mainly from the CS231n Python Tutorial With Google Colab.

Column 2

Alternative approach

I prefer to use Jupyter Notebook or PyCharm when I am not using Google Colab.

  • You can find an extensive tutorial by JetBrains on how to install and use PyCharm from here.

  • A beginner’s tutorial on how to use Jupyter Notebook: link

Monday

Column 1

Materials

We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advice all course participants to access the materials online. All lectures are in html format. Impractical files contain the exercises, without walkthrough, explanations and solutions. Practicals are walkthrough files that guide you through the exercises.

Here you will find the materials for Monday:

Column 2

Additional references

  • Ref 1: Chapters 2, 3, 4, 5, 6
  • Ref 2: Chapters 1, 2, 3, 4

Tuesday

Column 1

Materials

We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advice all course participants to access the materials online. All lectures are in html format. Impractical files contain the exercises, without walkthrough, explanations and solutions. Practicals are walkthrough files that guide you through the exercises.

Here you will find the materials for Tuesday:

Column 2

Additional references

  • Ref 2: Chapter 5

Wednesday

Column 1

Materials

We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advice all course participants to access the materials online. All lectures are in html format. Impractical files contain the exercises, without walkthrough, explanations and solutions. Practicals are walkthrough files that guide you through the exercises.

Here you will find the materials for Wednesday:

Column 2

Additional references

  • Ref 1: Chapters 6, 7, 9
  • Ref 2: Chapters 6, 14

Thursday

Column 1

Materials

We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advice all course participants to access the materials online. All lectures are in html format. Impractical files contain the exercises, without walkthrough, explanations and solutions. Practicals are walkthrough files that guide you through the exercises.

Here you will find the materials for Thursday:

Column 2

Additional references

  • Ref 1: Chapters 6, 7, 9
  • Ref 2: Chapters 6, 14

Archive

All of the course materials are availabe in a compact file for download:

Download the Materials

Special thanks to Zhenwei Yang, my master student who is now a PhD candidate, for helping me out with the materials.

I wish all the participants success with their Text Mining projects!