ChatGPT4o


Ask Delphi: Allen Institute for AI

Opinions on Plants and Animals through Time Van Dalfsen et al. 2024

Course Logistics

Course materials

Teachers

Qixiang

Luka

Hugh Mee

Ayoub

Program

Time Monday Tuesday Wednesday Thursday
9:00 - 10:30 Lecture 1 Lecture 3 Lecture 5 Lecture 7
Break Break Break Break
10:45 – 11:45 Practical 1 Practical 3 Practical 5 Practical 7
11:45 – 12:30 Discussion 1 Discussion 3 Discussion 5 Discussion 7
Lunch Lunch Lunch Lunch
14:00 – 15:30 Lecture 2 Lecture 4 Lecture 6 Lecture 8
Break Break Break Break
15:45 – 16:30 Practical 2 Practical 4 Practical 6 Practical 8
16:30 – 17:00 Discussion 2 Discussion 4 Discussion 6 Discussion 8

Goal of the course

  • Text data is everywhere!
  • A lot of world’s data is in unstructured text format
  • The course teaches
    • text mining techniques
    • using R
    • on a variety of applications
    • in many domains.

What is Text Mining?

Example

  • This is Garry!

  • Garry works at Bol.com (a webshop in the Netherlands)

  • He works in the dep of Customer relationship management.

  • He uses Excel to read and search customers’ reviews, extract aspects they wrote their reviews on, and identify their sentiments.

  • Curious about his job? See two examples!

This is a nice book for both young and old. It gives beautiful life lessons in a fun way. Definitely worth the money!

+ Educational

+ Funny

+ Price


Nice story for older children.

+ Funny

- Readability

Example

  • Garry likes his job a lot, but sometimes it is frustrating!

  • This is mainly because their company is expanding quickly!

  • Garry decides to hire Larry as his assistant.

Example

  • Still, a lot to do for two people!

  • Garry has some budget left to hire another assistant for couple of years!

  • He decides to hire Harry too!

  • Still, manual labeling using Excel is labor-intensive!

Language is hard!

  • Different things can mean more or less the same (“data science” vs. “statistics”)
  • Context dependency (“You have very nice shoes”);
  • Same words with different meanings (“to sanction”, “bank”);
  • Lexical ambiguity (“we saw her duck”)
  • Irony, sarcasm (“That’s just what I needed today!”, “Great!”, “Well, what a surprise.”)
  • Figurative language (“He has a heart of stone”)
  • Negation (“not good” vs. “good”), spelling variations, jargon, abbreviations
  • All the above are different over languages, 99% of work is on English!

Text Mining to the Rescue!

  • “the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources” Hearst (1999)

  • Text mining is about looking for patterns in text, in a similar way that data mining can be loosely described as looking for patterns in data.

  • Text mining describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)

  • We won’t solve linguistics …
  • In spite of the problems, text mining can be quite effective!

Applications

Text mining applications

Who wrote the Wilhelmus?

Text Classification

Which ICD-10 codes should I give this doctor’s note?

Sentiment Analysis / Opinion Mining

Statistical Machine Translation

Dialog Systems

Question Answering

Go beyond search

Which studies go in my systematic review?

And more …

  • Automatically classify political news from sports news

  • Authorship identification

  • Age/gender identification

  • Language Identification

Process & Tasks

Text mining process

Pattern discovery tasks in text

  • Text classification
  • Text clustering
  • Sentiment analysis
  • Feature selection
  • Topic modelling
  • Responsible text mining
  • Text summarization

And more in NLP

Regular Expressions

Regular expressions

Regular expressions



In computing, a regular expression, also referred to as “regex” or “regexp”, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor.





http://en.wikipedia.org/wiki/Regular_expression

Regular expressions

  • A formal language for specifying text strings

  • How can we search for any of these?

    • netherland

    • netherlands

    • Netherland

    • Netherlands

How to write regular expressions

There are some rules on how to write REs…

1. Some simple regex searches

Example in R

  • Define the text string
    • text <- “How much wood would a woodchuck chuck if a woodchucks could chuck wood?”
  • Check if “woodchuck” is in the text
    • found <- grepl(“woodchuck”, text)
  • Print the result
    • print(found)

2. Disjunction

The use of the brackets [] to specify a disjunction of characters:

The pipe sympol | is also for disjunction:

3. Brackets and dash

The use of the brackets [] plus the dash - to specify a range:

4. Negation

The caret ^ for negation or just to mean ^:

5. Question and period marks

The question mark ? marks optionality of the previous expression:

The use of the period . to specify any character:

6. Anchors

7. Common sets

Aliases for common sets of characters:

The backslash for escaping!

8. Operators for counting

  • Patterns are greedy: In these cases regular expressions always match the largest string they can, expanding to cover as much of a string as they can.

  • Enforce non-greedy matching, using another meaning of the ? qualifier.

    • The operator *? is a Kleene star that matches as little text as possible.
    • The operator +? is a Kleene plus that matches as little text as possible.

9. Other

Some characters that need to be backslashed:

Operator precedence hierarchy

Understanding Regular Expressions

  • Very powerful and quite cryptic

  • Fun once you understand them

  • Regular expressions are a programming language with characters

  • It is kind of an “old school” language

In R

The primary R functions for dealing with regular expressions are:

  • grep(), grepl(): Search for matches of a regular expression/pattern in a character vector

  • regexpr(), gregexpr(): Search a character vector for regular expression matches and return the indices where the match begins; useful in conjunction with regmatches()

  • sub(), gsub(): Search a character vector for regular expression matches and replace that match with another string

  • The stringr package provides a series of functions implementing much of the regular expression functionality in R but with a more consistent and rationalized interface.

Example

  • Find all instances of the word “the” in a text.
the

Misses capitalized examples



[tT]he

Incorrectly returns words such as other or Netherlands



[^a-zA-Z] [tT]he [^a-zA-Z]

Still not compeletly correct! What is missing?

Example

txt <- "The other the Netherlands will then be without the"
r   <- gregexpr("[^a-zA-Z][tT]he[^a-zA-Z]", txt)
print(regmatches(txt, r))
## [[1]]
## [1] " the "
r   <- gregexpr("(^|[^a-zA-Z])[tT]he($|[^a-zA-Z])", txt)
print(regmatches(txt, r))
## [[1]]
## [1] "The "  " the " " the"

Errors

  • The process we just went through was based on

    • Matching strings that we should not have matched (there, Netherlands)


    • Not matching things that we should have matched (The)


Errors

  • The process we just went through was based on fixing two kinds of errors

    • Matching strings that we should not have matched (there, Netherlands)

      • False positives (Type I)
    • Not matching things that we should have matched (The)

      • False negatives (Type II)

Errors cont.

  • In NLP we are always dealing with these kinds of errors.

  • Reducing the error rate for an application often involves:

    • Increasing precision (minimizing false positives)

    • Increasing recall (minimizing false negatives).

Question

Question

Suppose we want to build an application to help a user buy a car from textual catalogues. The user looks for any car cheaper than $10,000.00.

Assume we are using the following data: txt <- c(“Price of Tesla S is $8599.99.”, “Audi Q4 is $7000.”, “BMW X5 costs $900”)

Which RE will help us to do this?

  • (ˆ|\W)\$[0-9]{0,4}(\.[0-9][0-9])*
  • (ˆ|\W)\$[0-9]{0,3}(\.[0-9][0-9])+
  • (ˆ|\W)\$[0-9]{0,4}(\.[0-9][0-9])?
  • (ˆ|\W)\$[0-9][0-9][0-9][0-9](\.[0-9][0-9])*

Question

Solution

txt <- c("Price of Tesla S is $8599.99.", 
         "Audi Q4 is $7000.", 
         "BMW X5 costs $900") 
r <- gregexpr("(ˆ|\\W)\\$[0-9]{0,4}(\\.[0-9][0-9])?", txt)
print(regmatches(txt, r))
## [[1]]
## [1] " $8599.99"
## 
## [[2]]
## [1] " $7000"
## 
## [[3]]
## [1] " $900"

Summary

Summary

  • Text data is everywhere!
  • Language is hard!
  • Sophisticated sequences of regular expressions are often the first model for any text processing tool
  • Regular expressions are a cryptic but powerful language for matching strings and extracting elements from those strings
  • The basic problem of text mining is that text is not a neat data set
  • One solution: text pre-processing

Next: Text preprocessing

  • is an approach for cleaning and noise removal of text data.
  • brings your text into a form that is analyzable for your task.
  • transforms text into a more digestible form so that machine learning algorithms can perform better.

Practical 1

Are you curious about the end of the Example?

  • During one of the coffee moments at the company, Garry was talking about their situation at the dep of Customer relationship management.

  • When Carrie, her colleague from the Data Science department, hears the situation, she offers Garry to use Text Mining!!

  • She says: “ Text mining is your friend; it can help you to make the process way faster than Excel by filtering words and recommending labels.

  • She continues : “Text mining is a subfield of AI and NLP and is related to data science, data mining and machine learning.”

  • After consulting with Larry and Harry, they decide to give text mining a try!

End