ChatGPT4o


Ask Delphi: Allen Institute for AI

Opinions on Plants and Animals through Time Van Dalfsen et al. 2024

Course Logistics

Course materials

Teachers

Pablo

Arjan

Jelte

Dong

Ayoub

Program

Time Monday Tuesday Wednesday Thursday Friday
9:00 - 10:30 Lecture 1 Lecture 3 Lecture 5 Lecture 7 Lecture 9
Break Break Break Break Break
10:50 – 12:00 Practical 1 Practical 3 Practical 5 Practical 7 Practical 9
12:00 – 12:30 Discussion 1 Discussion 3 Discussion 5 Discussion 7 Discussion 9
Lunch Lunch Lunch Lunch Lunch
14:00 – 15:20 Lecture 2 Lecture 4 Lecture 6 Lecture 8 Lecture 10
Break Break Break Break Break
15:30 – 16:30 Practical 2 Practical 4 Practical 6 Practical 8 Practical 10
16:30 – 17:00 Discussion 2 Discussion 4 Discussion 6 Discussion 8 Discussion 10

Goal of the course

  • Text data are everywhere!
  • A lot of world’s data are in the format of unstructured text
  • This course teaches
    • text mining techniques
    • using Python
    • on a variety of applications
    • in many domains.

Python?

How familiar are you with Python?

  • What is your experience level with Python?


Python IDE?

  • Which Python IDE do you mostly use? If you use more than one environment fill in the other text boxes.


Google Colab?

  • How familiar are you with Google Colab? (1: limited to 5: expert)


Python

Google Colab

What is Text Mining?

Text mining in an example

  • This is Garry!

  • Garry works at Bol.com (a webshop in the Netherlands)

  • He works in the dep of Customer relationship management.

  • He uses Excel to read and search customers’ reviews, extract aspects they wrote their reviews on, and identify their sentiments.

  • Curious about his job? See two examples!

This is a nice book for both young and old. It gives beautiful life lessons in a fun way. Definitely worth the money!

+ Educational

+ Funny

+ Price


Nice story for older children.

+ Funny

- Readability

Example

  • Garry likes his job a lot, but sometimes it is frustrating!

  • This is mainly because their company is expanding quickly!

  • Garry decides to hire Larry as his assistant.

Example

  • Still, a lot to do for two people!

  • Garry has some budget left to hire another assistant for couple of years!

  • He decides to hire Harry too!

  • Still, manual labeling using Excel is labor-intensive!

Language is hard!

  • Different things can mean more or less the same (“data science” vs. “statistics”)
  • Context dependency (“You have very nice shoes”);
  • Same words with different meanings (“to sanction”, “bank”);
  • Lexical ambiguity (“we saw her duck”)
  • Irony, sarcasm (“That’s just what I needed today!”, “Great!”, “Well, what a surprise.”)
  • Figurative language (“He has a heart of stone”)
  • Negation (“not good” vs. “good”), spelling variations, jargon, abbreviations
  • All the above are different over languages, 99% of work is on English!

Text mining

  • “the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources” Hearst (1999)

  • Text mining is about looking for patterns in text, in a similar way that data mining can be loosely described as looking for patterns in data.

  • Text mining describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)

Can be quite effective!

  • We won’t solve linguistics …
  • In spite of the problems, text mining can be quite effective!

Process & Tasks

Text mining process

Text mining tasks

  • Text classification
  • Text clustering
  • Sentiment analysis
  • Feature selection
  • Topic modelling
  • Responsible text mining
  • Text summarization

And more in NLP

Text Preprocessing

Text preprocessing

  • is an approach for cleaning and noise removal of text data.
  • brings your text into a form that is analyzable for your task.
  • transforms text into a more digestible form so that machine learning algorithms can perform better.

Typical steps

  • Tokenization (“text”, “ming”, “is”, “the”, “best” , “!”)
  • Stemming (“lungs”→“lung”) or Lemmatization (“were”→“is”)
  • Lowercasing (“Disease”→“disease”)
  • Stopword removal (“text ming is best!”)
  • Punctuation removal (“text ming is the best”)
  • Number removal (“I42”→“I”)
  • Spell correction (“hart”→“heart”)

Not all of these are appropriate at all times!

Tokenization/Segmentation

  • Split text into words and sentences

N-grams

  • N-grams: a contiguous sequence of N tokens from a given piece of text

    • E.g., ‘Text mining is to identify useful information.’

    • Bigrams: ‘text_mining’, ‘mining_is’, ‘is_to’, ‘to_identify’, ‘identify_useful’, ‘useful_information’, ‘information_.’

  • Pros: capture local dependency and order
  • Cons: increase the vocabulary size

Part Of Speech (POS) tagging

  • Annotate each word in a sentence with a part-of-speech.

  • Useful for subsequent syntactic parsing and word sense disambiguation.

Vector Space Model

Basic idea

  • Text is “unstructured data”
  • How do we get to something structured that we can compute with?
  • Text must be represented somehow
  • Represent the text as something that makes sense to a computer

How to represent a document

  • Represent by a string?

    • No semantic meaning
  • Represent by a list of sentences?

    • Sentence is just like a short document (recursive definition)
  • Represent by a vector?

    • A vector is an ordered finite list of numbers.

Vector space model

  • A vector space is a collection of vectors

  • Represent documents by concept vectors

    • Each concept defines one dimension

    • k concepts define a high-dimensional space

    • Element of vector corresponds to concept weight

Vector space model

  • Distance between the vectors in this concept space

    • Relationship among documents
  • The process of converting text into numbers is called Vectorization

Vector space model

  • Terms are generic features that can be extracted from text

  • Typically, terms are single words, keywords, n-grams, or phrases

  • Documents are represented as vectors of terms

  • Each dimension (concept) corresponds to a separate term

\[d = (w_1, ..., w_n)\]

An illustration of VS model

  • All documents are projected into this concept space

VSM: How do we represent vectors?

Bag of Words (BOW)

  • Terms are words (more generally we can use n-grams)
  • Weights are number of occurrences of the terms in the document
    • Binary
    • Term Frequency (TF)
    • Term Frequency inverse Document Frequency (TFiDF)

Binary

  • Doc1: Text mining is to identify useful information.

  • Doc2: Useful information is mined from text.

  • Doc3: Apple is delicious.

Term Frequency

  • Idea: a term is more important if it occurs more frequently in a document

  • TF formulas

    • Let \(t(c,d)\) be the frequency count of term \(t\) in doc \(d\)

    • Raw TF: \(tf(t,d) = c(t,d)\)

TF: Document - Term Matrix (DTM)

TFiDF

  • Idea: a term is more discriminative if it occurs a lot but only in fewer documents

Let \(n_{d,t}\) denote the number of times the \(t\)-th term appears in the \(d\)-th document.

\[TF_{d,t} = \frac{n_{d,t}}{\sum_i{n_{d,i}}}\] Let \(N\) denote the number of documents annd \(N_t\) denote the number of documents containing the \(t\)-th term.

\[IDF_t = log(\frac{N}{N_t})\] TFiDF weight:

\[w_{d,t} = TF_{d,t} \cdot IDF_t\]

TFiDF: Document - Term matrix (DTM)

How to define a good similarity metric?

How to define a good similarity metric?

  • Euclidean distance

    \(dist(d_i, d_j) = \sqrt{\sum_{t\in V}{[tf(t,d_i)idf(t) - tf(t, d_j)idf(t)]^2}}\)

    • Longer documents will be penalized by the extra words

    • We care more about how these two vectors are overlapped

  • Cosine similarity

    • Angle between two vectors:

      \(cosine(d_i, d_j) = \frac{V_{d_i}^TV_{d_j}}{|V_{d_i}|_2 \times |V_{d_j}|_2}\) ← TF-IDF vector

    • Documents are normalized by length

Next

  • Text classification

Summary

Summary

  • Text data are everywhere!
  • Language is hard!
  • The basic problem of text mining is that text is not a neat data set
  • Solution: text pre-processing & VSM

Practical 1