Introduction

ChatGPT4o

Ask Delphi: Allen Institute for AI

Opinions on Plants and Animals through Time Van Dalfsen et al. 2024

Course Logistics

Course materials

You can access the course materials quickly from

https://ayoubbagheri.nl/applied_tm/

Teachers

Pablo

Arjan

Jelte

Dong

Ayoub

Program

Time	Monday	Tuesday	Wednesday	Thursday	Friday
9:00 - 10:30	Lecture 1	Lecture 3	Lecture 5	Lecture 7	Lecture 9
	Break	Break	Break	Break	Break
10:50 – 12:00	Practical 1	Practical 3	Practical 5	Practical 7	Practical 9
12:00 – 12:30	Discussion 1	Discussion 3	Discussion 5	Discussion 7	Discussion 9
	Lunch	Lunch	Lunch	Lunch	Lunch
14:00 – 15:20	Lecture 2	Lecture 4	Lecture 6	Lecture 8	Lecture 10
	Break	Break	Break	Break	Break
15:30 – 16:30	Practical 2	Practical 4	Practical 6	Practical 8	Practical 10
16:30 – 17:00	Discussion 2	Discussion 4	Discussion 6	Discussion 8	Discussion 10

Goal of the course

Text data are everywhere!
A lot of world’s data are in the format of unstructured text
This course teaches
- text mining techniques
- using Python
- on a variety of applications
- in many domains.

Python?

How familiar are you with Python?

What is your experience level with Python?

Python IDE?

Which Python IDE do you mostly use? If you use more than one environment fill in the other text boxes.

Google Colab?

How familiar are you with Google Colab? (1: limited to 5: expert)

Python

Latest: Python 3.12.4
Follow the tutorial on Python in Google Colab for the Applied Text Mining course: link
Python For Beginners
- https://www.python.org/about/gettingstarted/
The Python Language Reference
- https://docs.python.org/3/reference/
Python 3.12.4 documentation
- https://www.python.org/doc/

Google Colab

Colaboratory, or “Colab” for short, allows you to write and execute Python in your browser, with
- Zero configuration required
- Free access to GPUs
- Easy sharing
[Intro](https://colab.research.google.com/notebooks/intro.ipynb)
Cheat-sheet for Google Colab
Keyboard shortcuts:

What is Text Mining?

Text mining in an example

This is Garry!
Garry works at Bol.com (a webshop in the Netherlands)
He works in the dep of Customer relationship management.
He uses Excel to read and search customers’ reviews, extract aspects they wrote their reviews on, and identify their sentiments.
Curious about his job? See two examples!

This is a nice book for both young and old. It gives beautiful life lessons in a fun way. Definitely worth the money!

+ Educational

+ Funny

+ Price

Nice story for older children.

+ Funny

- Readability

Example

Garry likes his job a lot, but sometimes it is frustrating!
This is mainly because their company is expanding quickly!
Garry decides to hire Larry as his assistant.

Example

Still, a lot to do for two people!
Garry has some budget left to hire another assistant for couple of years!
He decides to hire Harry too!
Still, manual labeling using Excel is labor-intensive!

Language is hard!

Different things can mean more or less the same (“data science” vs. “statistics”)
Context dependency (“You have very nice shoes”);
Same words with different meanings (“to sanction”, “bank”);
Lexical ambiguity (“we saw her duck”)
Irony, sarcasm (“That’s just what I needed today!”, “Great!”, “Well, what a surprise.”)
Figurative language (“He has a heart of stone”)
Negation (“not good” vs. “good”), spelling variations, jargon, abbreviations
All the above are different over languages, 99% of work is on English!

Text mining

“the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources” Hearst (1999)
Text mining is about looking for patterns in text, in a similar way that data mining can be loosely described as looking for patterns in data.
Text mining describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)

Can be quite effective!

We won’t solve linguistics …
In spite of the problems, text mining can be quite effective!

Process & Tasks

Text mining process

Text mining tasks

Text classification
Text clustering
Sentiment analysis
Feature selection
Topic modelling
Responsible text mining
Text summarization

And more in NLP

source: https://nlp.stanford.edu/~wcmac/papers/20140716-UNLU.pdf

Text Preprocessing

Text preprocessing

is an approach for cleaning and noise removal of text data.
brings your text into a form that is analyzable for your task.
transforms text into a more digestible form so that machine learning algorithms can perform better.

Typical steps

Tokenization (“text”, “ming”, “is”, “the”, “best” , “!”)
Stemming (“lungs”→“lung”) or Lemmatization (“were”→“is”)
Lowercasing (“Disease”→“disease”)
Stopword removal (“text ming is best!”)
Punctuation removal (“text ming is the best”)
Number removal (“I42”→“I”)
Spell correction (“hart”→“heart”)

Not all of these are appropriate at all times!

Tokenization/Segmentation

Split text into words and sentences

N-grams

N-grams: a contiguous sequence of N tokens from a given piece of text
- E.g., ‘Text mining is to identify useful information.’
- Bigrams: ‘text_mining’, ‘mining_is’, ‘is_to’, ‘to_identify’, ‘identify_useful’, ‘useful_information’, ‘information_.’

Pros: capture local dependency and order

Cons: increase the vocabulary size

Part Of Speech (POS) tagging

Annotate each word in a sentence with a part-of-speech.

Useful for subsequent syntactic parsing and word sense disambiguation.

Vector Space Model

Basic idea

Text is “unstructured data”
How do we get to something structured that we can compute with?
Text must be represented somehow
Represent the text as something that makes sense to a computer

How to represent a document

Represent by a string?
- No semantic meaning
Represent by a list of sentences?
- Sentence is just like a short document (recursive definition)
Represent by a vector?
- A vector is an ordered finite list of numbers.

Vector space model

A vector space is a collection of vectors
Represent documents by concept vectors
- Each concept defines one dimension
- k concepts define a high-dimensional space
- Element of vector corresponds to concept weight

Vector space model

Distance between the vectors in this concept space
- Relationship among documents
The process of converting text into numbers is called Vectorization

Vector space model

Terms are generic features that can be extracted from text
Typically, terms are single words, keywords, n-grams, or phrases
Documents are represented as vectors of terms
Each dimension (concept) corresponds to a separate term

\[d = (w_1, ..., w_n)\]

An illustration of VS model

All documents are projected into this concept space

VSM: How do we represent vectors?

Bag of Words (BOW)

Terms are words (more generally we can use n-grams)
Weights are number of occurrences of the terms in the document
- Binary
- Term Frequency (TF)
- Term Frequency inverse Document Frequency (TFiDF)

Binary

Doc1: Text mining is to identify useful information.
Doc2: Useful information is mined from text.
Doc3: Apple is delicious.

Term Frequency

Idea: a term is more important if it occurs more frequently in a document
TF formulas
- Let \(t(c,d)\) be the frequency count of term \(t\) in doc \(d\)
- Raw TF: \(tf(t,d) = c(t,d)\)

TF: Document - Term Matrix (DTM)

TFiDF

Idea: a term is more discriminative if it occurs a lot but only in fewer documents

Let \(n_{d,t}\) denote the number of times the \(t\)-th term appears in the \(d\)-th document.

\[TF_{d,t} = \frac{n_{d,t}}{\sum_i{n_{d,i}}}\] Let \(N\) denote the number of documents annd \(N_t\) denote the number of documents containing the \(t\)-th term.

\[IDF_t = log(\frac{N}{N_t})\] TFiDF weight:

\[w_{d,t} = TF_{d,t} \cdot IDF_t\]

TFiDF: Document - Term matrix (DTM)

How to define a good similarity metric?

Euclidean distance

\(dist(d_i, d_j) = \sqrt{\sum_{t\in V}{[tf(t,d_i)idf(t) - tf(t, d_j)idf(t)]^2}}\)
- Longer documents will be penalized by the extra words
- We care more about how these two vectors are overlapped
Cosine similarity
- Angle between two vectors:
  
  \(cosine(d_i, d_j) = \frac{V_{d_i}^TV_{d_j}}{|V_{d_i}|_2 \times |V_{d_j}|_2}\) ← TF-IDF vector
- Documents are normalized by length

Text classification

Summary

Text data are everywhere!
Language is hard!
The basic problem of text mining is that text is not a neat data set
Solution: text pre-processing & VSM

ChatGPT4o

Ask Delphi: Allen Institute for AI

Opinions on Plants and Animals through Time Van Dalfsen et al. 2024

Course Logistics

Course materials

Teachers

Program

Goal of the course

Python?

How familiar are you with Python?

Python IDE?

Google Colab?

Python

Google Colab

What is Text Mining?

Text mining in an example

Example

Example

Language is hard!

Text mining

Can be quite effective!

Process & Tasks

Text mining process

Text mining tasks

And more in NLP

Text Preprocessing

Text preprocessing

Typical steps

Tokenization/Segmentation

N-grams

Part Of Speech (POS) tagging

Vector Space Model

Basic idea

How to represent a document

Vector space model

Vector space model

Vector space model

An illustration of VS model

VSM: How do we represent vectors?

Bag of Words (BOW)

Binary

Term Frequency

TF: Document - Term Matrix (DTM)

TFiDF

TFiDF: Document - Term matrix (DTM)

How to define a good similarity metric?

How to define a good similarity metric?

Next

Summary

Summary

Practical 1