You can access the course materials quickly from
Time | Monday | Tuesday | Wednesday | Thursday |
---|---|---|---|---|
9:00 - 10:30 | Lecture 1 | Lecture 3 | Lecture 5 | Lecture 7 |
Break | Break | Break | Break | |
10:45 – 11:45 | Practical 1 | Practical 3 | Practical 5 | Practical 7 |
11:45 – 12:30 | Discussion 1 | Discussion 3 | Discussion 5 | Discussion 7 |
Lunch | Lunch | Lunch | Lunch | |
14:00 – 15:30 | Lecture 2 | Lecture 4 | Lecture 6 | Lecture 8 |
Break | Break | Break | Break | |
15:45 – 16:30 | Practical 2 | Practical 4 | Practical 6 | Practical 8 |
16:30 – 17:00 | Discussion 2 | Discussion 4 | Discussion 6 | Discussion 8 |
This is Garry!
Garry works at Bol.com (a webshop in the Netherlands)
He works in the dep of Customer relationship management.
He uses Excel to read and search customers’ reviews, extract aspects they wrote their reviews on, and identify their sentiments.
Curious about his job? See two examples!
This is a nice book for both young and old. It gives beautiful life lessons in a fun way. Definitely worth the money!
+ Educational
+ Funny
+ Price
Nice story for older children.
+ Funny
- ReadabilityGarry likes his job a lot, but sometimes it is frustrating!
This is mainly because their company is expanding quickly!
Garry decides to hire Larry as his assistant.
Still, a lot to do for two people!
Garry has some budget left to hire another assistant for couple of years!
He decides to hire Harry too!
Still, manual labeling using Excel is labor-intensive!
“the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources” Hearst (1999)
Text mining is about looking for patterns in text, in a similar way that data mining can be loosely described as looking for patterns in data.
Text mining describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)
Automatically classify political news from sports news
Authorship identification
Age/gender identification
Language Identification
…
Regular Expression one of the first and important tm techniques we must learn about!
Really clever “wild card” expressions for matching and parsing strings.
In computing, a regular expression, also referred to as “regex” or “regexp”, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor.
A formal language for specifying text strings
How can we search for any of these?
netherland
netherlands
Netherland
Netherlands
There are some rules on how to write REs…
The use of the brackets [] to specify a disjunction of characters:
The pipe sympol | is also for disjunction:
The use of the brackets [] plus the dash - to specify a range:
The caret ^ for negation or just to mean ^:
The question mark ? marks optionality of the previous expression:
The use of the period . to specify any character:
Aliases for common sets of characters:
The backslash for escaping!
Patterns are greedy: In these cases regular expressions always match the largest string they can, expanding to cover as much of a string as they can.
Enforce non-greedy matching, using another meaning of the ? qualifier.
Some characters that need to be backslashed:
Very powerful and quite cryptic
Fun once you understand them
Regular expressions are a programming language with characters
It is kind of an “old school” language
The primary R functions for dealing with regular expressions are:
grep()
, grepl()
: Search for matches of a regular expression/pattern in a character vector
regexpr()
, gregexpr()
: Search a character vector for regular expression matches and return the indices where the match begins; useful in conjunction with regmatches()
sub()
, gsub()
: Search a character vector for regular expression matches and replace that match with another string
The stringr
package provides a series of functions implementing much of the regular expression functionality in R but with a more consistent and rationalized interface.
Misses capitalized examples
Incorrectly returns words such as other or Netherlands
[^a-zA-Z]
[tT]he
[^a-zA-Z]
Still not compeletly correct! What is missing?
txt <- "The other the Netherlands will then be without the" r <- gregexpr("[^a-zA-Z][tT]he[^a-zA-Z]", txt) print(regmatches(txt, r))
## [[1]] ## [1] " the "
r <- gregexpr("(^|[^a-zA-Z])[tT]he($|[^a-zA-Z])", txt) print(regmatches(txt, r))
## [[1]] ## [1] "The " " the " " the"
The process we just went through was based on
The process we just went through was based on fixing two kinds of errors
Matching strings that we should not have matched (there, Netherlands)
Not matching things that we should have matched (The)
In NLP we are always dealing with these kinds of errors.
Reducing the error rate for an application often involves:
Increasing precision (minimizing false positives)
Increasing recall (minimizing false negatives).
Suppose we want to build an application to help a user buy a car from textual catalogues. The user looks for any car cheaper than $10,000.00.
Assume we are using the following data: txt <- c(“Price of Tesla S is $8599.99.”, “Audi Q4 is $7000.”, “BMW X5 costs $900”)
Which RE will help us to do this?
txt <- c("Price of Tesla S is $8599.99.", "Audi Q4 is $7000.", "BMW X5 costs $900") r <- gregexpr("(ˆ|\\W)\\$[0-9]{0,4}(\\.[0-9][0-9])?", txt) print(regmatches(txt, r))
## [[1]] ## [1] " $8599.99" ## ## [[2]] ## [1] " $7000" ## ## [[3]] ## [1] " $900"
During one of the coffee moments at the company, Garry was talking about their situation at the dep of Customer relationship management.
When Carrie, her colleague from the Data Science department, hears the situation, she offers Garry to use Text Mining!!
She says: “ Text mining is your friend; it can help you to make the process way faster than Excel by filtering words and recommending labels.”
She continues : “Text mining is a subfield of AI and NLP and is related to data science, data mining and machine learning.”
After consulting with Larry and Harry, they decide to give text mining a try!