You can access the course materials quickly from
Time | Monday | Tuesday | Wednesday | Thursday | Friday |
---|---|---|---|---|---|
9:00 - 10:30 | Lecture 1 | Lecture 3 | Lecture 5 | Lecture 7 | Lecture 9 |
Break | Break | Break | Break | Break | |
10:50 – 12:00 | Practical 1 | Practical 3 | Practical 5 | Practical 7 | Practical 9 |
12:00 – 12:30 | Discussion 1 | Discussion 3 | Discussion 5 | Discussion 7 | Discussion 9 |
Lunch | Lunch | Lunch | Lunch | Lunch | |
14:00 – 15:20 | Lecture 2 | Lecture 4 | Lecture 6 | Lecture 8 | Lecture 10 |
Break | Break | Break | Break | Break | |
15:30 – 16:30 | Practical 2 | Practical 4 | Practical 6 | Practical 8 | Practical 10 |
16:30 – 17:00 | Discussion 2 | Discussion 4 | Discussion 6 | Discussion 8 | Discussion 10 |
Latest: Python 3.12.4
Follow the tutorial on Python in Google Colab for the Applied Text Mining course: link
Python For Beginners
The Python Language Reference
Python 3.12.4 documentation
Colaboratory, or “Colab” for short, allows you to write and execute Python in your browser, with
[Intro](https://colab.research.google.com/notebooks/intro.ipynb)
Keyboard shortcuts:
This is Garry!
Garry works at Bol.com (a webshop in the Netherlands)
He works in the dep of Customer relationship management.
He uses Excel to read and search customers’ reviews, extract aspects they wrote their reviews on, and identify their sentiments.
Curious about his job? See two examples!
This is a nice book for both young and old. It gives beautiful life lessons in a fun way. Definitely worth the money!
+ Educational
+ Funny
+ Price
Nice story for older children.
+ Funny
- ReadabilityGarry likes his job a lot, but sometimes it is frustrating!
This is mainly because their company is expanding quickly!
Garry decides to hire Larry as his assistant.
Still, a lot to do for two people!
Garry has some budget left to hire another assistant for couple of years!
He decides to hire Harry too!
Still, manual labeling using Excel is labor-intensive!
“the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources” Hearst (1999)
Text mining is about looking for patterns in text, in a similar way that data mining can be loosely described as looking for patterns in data.
Text mining describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)
Not all of these are appropriate at all times!
N-grams: a contiguous sequence of N tokens from a given piece of text
E.g., ‘Text mining is to identify useful information.’
Bigrams: ‘text_mining’, ‘mining_is’, ‘is_to’, ‘to_identify’, ‘identify_useful’, ‘useful_information’, ‘information_.’
Represent by a string?
Represent by a list of sentences?
Represent by a vector?
A vector space is a collection of vectors
Represent documents by concept vectors
Each concept defines one dimension
k concepts define a high-dimensional space
Element of vector corresponds to concept weight
Distance between the vectors in this concept space
The process of converting text into numbers is called Vectorization
Terms are generic features that can be extracted from text
Typically, terms are single words, keywords, n-grams, or phrases
Documents are represented as vectors of terms
Each dimension (concept) corresponds to a separate term
\[d = (w_1, ..., w_n)\]
Doc1: Text mining is to identify useful information.
Doc2: Useful information is mined from text.
Doc3: Apple is delicious.
Idea: a term is more important if it occurs more frequently in a document
TF formulas
Let \(t(c,d)\) be the frequency count of term \(t\) in doc \(d\)
Raw TF: \(tf(t,d) = c(t,d)\)
Let \(n_{d,t}\) denote the number of times the \(t\)-th term appears in the \(d\)-th document.
\[TF_{d,t} = \frac{n_{d,t}}{\sum_i{n_{d,i}}}\] Let \(N\) denote the number of documents annd \(N_t\) denote the number of documents containing the \(t\)-th term.
\[IDF_t = log(\frac{N}{N_t})\] TFiDF weight:
\[w_{d,t} = TF_{d,t} \cdot IDF_t\]
Euclidean distance
\(dist(d_i, d_j) = \sqrt{\sum_{t\in V}{[tf(t,d_i)idf(t) - tf(t, d_j)idf(t)]^2}}\)
Longer documents will be penalized by the extra words
We care more about how these two vectors are overlapped
Cosine similarity
Angle between two vectors:
\(cosine(d_i, d_j) = \frac{V_{d_i}^TV_{d_j}}{|V_{d_i}|_2 \times |V_{d_j}|_2}\) ← TF-IDF vector
Documents are normalized by length