Introduction

In this practical, we will apply a topic modeling approach on news articles to cluster them into different gourps. Here, we are going to use the following packages:

library(tm)
library(tidyverse)
library(tidytext)
library(proxy)
library(topicmodels)
library(caret)

Read data


  1. The dataset that we are going to use is the BBC dataset from practical 3. This dataset consists of 2225 documents and 5 categories: business, entertainment, politics, sport, and tech. For text clustering and topic modeling, we will ignore the labels but we will use them while evaluating models. Load the Rda file of the news dataset.

load("data/news_dataset.rda")
head(df_final)
# Note that the next chunk (when using `DocumentTermMatrix`) will not work
# returns error invalid multibyte string 1512
# Two possible solutions: 
# 1) Omission of the whole observation
# df_final <- df_final[-1512,]
# 2) Omission of the string "\xa315.8m" from that specifc text
df_final[1512,][2] <- gsub("\xa315.8m", "", df_final[1512,][2])
# I prefer the second solution

  1. First we must preprocess the corpus. Create a document-term matrix from the news column of the dataframe. Complete the following preprocessing steps:
  • convert to lower
  • remove stop words
  • remove numbers
  • remove punctuation
  • convert dtm to a dataframe

docs <- VCorpus(VectorSource(df_final$Content))

dtm <- DocumentTermMatrix(docs,
           control = list(tolower = TRUE,
                          removeNumbers = TRUE,
                          removePunctuation = TRUE,
                          stopwords = TRUE
                         ))
# We remove A LOT of features. R is natively very weak with high dimensional data
dtm_cut <- removeSparseTerms(dtm,  sparse = 0.93)
# with sparse = 0.93 the dtm will end up with 359 terms; you can adjust this number based on your available memory

Topic Modeling


Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.

  1. Use the LDA function from the topicmodels package and train an LDA model with 5 topics with the Gibbs sampling method.

# A LDA topic model with 5 topics; if it takes a lot of time for you to run this change k to 2
out_lda <- LDA(dtm_cut, k = 5, method= "Gibbs", control = list(seed = 321))
out_lda
## A LDA_Gibbs topic model with 5 topics.

  1. The tidy() method is originally from the broom package (Robinson 2017), for tidying model objects. The tidytext package provides this method for extracting the per-topic-per-word probabilities, called “beta”, from the LDA model. Use this function and check the beta probabilities for each term and topic.

lda_topics <- tidy(out_lda, matrix = "beta")
lda_topics

  1. Plot the top 20 terms within each topic. Try to label them and compare your labeling with the original categories in the dataset.

lda_top_terms <- lda_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% # We use dplyr’s slice_max() to find the top 10 terms within each topic.
  ungroup() %>%
  arrange(topic, -beta)

lda_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()


  1. Use the code below to present and save the terms and topics in a wide format.

beta_wide <- lda_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  pivot_wider(names_from = topic, values_from = beta) %>% 
  mutate(log_ratio21 = log2(topic2 / topic1)) %>% 
  mutate(log_ratio31 = log2(topic3 / topic1))%>% 
  mutate(log_ratio41 = log2(topic4 / topic1))%>% 
  mutate(log_ratio51 = log2(topic5 / topic1))

beta_wide

  1. Use the log ratios to visualize the words with the greatest differences between topic 1 and other topics.

# topic 1 versus topic 2
lda_top_terms1 <- beta_wide %>%
  slice_max(log_ratio21, n = 10) %>%
  arrange(term, -log_ratio21)

lda_top_terms2 <- beta_wide %>%
  slice_max(-log_ratio21, n = 10) %>%
  arrange(term, -log_ratio21)

lda_top_terms12 <- rbind(lda_top_terms1, lda_top_terms2)

# this is for ggplot to understand in which order to plot name on the x axis.
lda_top_terms12$term <- factor(lda_top_terms12$term, levels = lda_top_terms12$term[order(lda_top_terms12$log_ratio21)])

# Words with the greatest difference in beta between topic 2 and topic 1
lda_top_terms12 %>%
  ggplot(aes(log_ratio21, term, fill = (log_ratio21 > 0))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  theme_minimal()



# topic 1 versus topic 3
lda_top_terms1 <- beta_wide %>%
  slice_max(log_ratio31, n = 10) %>%
  arrange(term, -log_ratio31)

lda_top_terms2 <- beta_wide %>%
  slice_max(-log_ratio31, n = 10) %>%
  arrange(term, -log_ratio31)

lda_top_terms13 <- rbind(lda_top_terms1, lda_top_terms2)

# this is for ggplot to understand in which order to plot name on the x axis.
lda_top_terms13$term <- factor(lda_top_terms13$term, levels = lda_top_terms13$term[order(lda_top_terms13$log_ratio31)])

# Words with the greatest difference in beta between topic 2 and topic 1
lda_top_terms13 %>%
  ggplot(aes(log_ratio31, term, fill = (log_ratio31 > 0))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  theme_minimal()



# topic 1 versus topic 4
lda_top_terms1 <- beta_wide %>%
  slice_max(log_ratio41, n = 10) %>%
  arrange(term, -log_ratio41)

lda_top_terms2 <- beta_wide %>%
  slice_max(-log_ratio41, n = 10) %>%
  arrange(term, -log_ratio41)

lda_top_terms14 <- rbind(lda_top_terms1, lda_top_terms2)

lda_top_terms14[1,]$term <- 'SPELLING ERROR!'

# this is for ggplot to understand in which order to plot name on the x axis.
lda_top_terms14$term <- factor(lda_top_terms14$term, levels = lda_top_terms14$term[order(lda_top_terms14$log_ratio41)])

# Words with the greatest difference in beta between topic 2 and topic 1
lda_top_terms14 %>%
  ggplot(aes(log_ratio41, term, fill = (log_ratio41 > 0))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  theme_minimal()



# topic 1 versus topic 5
lda_top_terms1 <- beta_wide %>%
  slice_max(log_ratio51, n = 10) %>%
  arrange(term, -log_ratio51)

lda_top_terms2 <- beta_wide %>%
  slice_max(-log_ratio51, n = 10) %>%
  arrange(term, -log_ratio51)

lda_top_terms15 <- rbind(lda_top_terms1, lda_top_terms2)

# this is for ggplot to understand in which order to plot name on the x axis.
lda_top_terms15$term <- factor(lda_top_terms15$term, levels = lda_top_terms15$term[order(lda_top_terms15$log_ratio51)])

# Words with the greatest difference in beta between topic 2 and topic 1
lda_top_terms15 %>%
  ggplot(aes(log_ratio51, term, fill = (log_ratio51 > 0))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  theme_minimal()


  1. Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities, called “gamma”, with the matrix = “gamma” argument to tidy(). Call this function for your LDA model and save the probabilities in a varibale named lda_documents.

lda_documents <- tidy(out_lda, matrix = "gamma")

  1. Check the topic probabilities for some of the documents (for example documents 1, 1000, 2000, 2225). Also look at the contents of some them.

lda_documents[lda_documents$document == 1,]
lda_documents[lda_documents$document == 1000,]
lda_documents[lda_documents$document == 2000,]
lda_documents[lda_documents$document == 2225,]

tidy(dtm_cut) %>%
  filter(document == 2225) %>%
  arrange(desc(count))

df_final[2225,]$Content
## [1] "Losing yourself in online gaming\n\nOnline role playing games are time-consuming, but enthralling flights from reality. But are some people taking their fantasy lives too seriously?\n\nWhen video game World of Warcraft hit the shops in Europe last week fans wrote in to the BBC website to express their delight - and to offer a warning. \"An addiction to a game like this is far more costly in time than any substance could impair - keep track of time,\" wrote Travis Anderson, in Texas. Some of the comments were humorous: \"This game is so good I'm not going to get it, there's no way I could limit the hours I'd spend playing it,\" wrote Charles MacIntyre, from England.\n\nBut some struck a more worrying tone about the Massively Multiplayer Online Role Playing Game (MMORPG): \"'You need to get out more' could be the motto of any MMORPG. Shame they are getting more popular, as you know this problem is just going to mushroom,\" wrote Stuart Stanton-Davies, in Huddersfield. Scare-mongering articles about \"addictive video games\" have existed since the days the first game of Pong stopped everyone from working at the Atari offices.\n\nGaming is like any other pastime - it can quickly become an unhealthy obsession, whether it is spending too much time in the gym, in front of the television, or reading poetry.\n\nUnfortunately, gaming and addiction is a far too easy association to make. However, stories about gamers spending 10 to 15 hours a day in front of some video games are becoming more frequent. And the impact that is having on their families is quite distressing for some.\n\nMassively multiplayer online role playing games - MMORPGs - allow thousands of gamers to share a common experience of sharing fantasy or science fiction worlds. The scope of these games - like Warcraft, EverQuest, Ultima among others - is epic, and exploration and adventure is almost infinite. Part of the \"problem\" is grinding - by which gamers have to perform long-winded, mindless tasks, to bring up their levels and gain access to more adventure. Such open-endedness brings with it a desire to keep playing; not for no reason is EverQuest (EQ) nicknamed EverCrack. E Hayot, writing in the culture blogzine Print Culture, said recently: \"I used to play the online role-playing game EverQuest a lot. \"By 'a lot', I mean probably 15 to 20 hours a week on average, and on weeks where I didn't have to work, as many as 30 or 40 hours.\"\n\nHe says that in the world of online gaming such behaviour \"wasn't that unusual; lots of people I knew in the game played EQ that much\". \"You lie; you don't go into work because you \"had stuff to do at home\"; you cancel or refuse invitations to dinner, you spend much less time watching TV (a good thing, presumably),\" he wrote, explaining how EverQuest took over his time. He quit the game, he says, because he realised life was more fun than EverQuest.\n\nLet us be clear - such obsession is rare. But the huge growth in online gaming means a growth in the numbers of people who take their passion for a hobby too far. Almost 400,000 people bought a copy of World of Warcraft in the first two days on sale earlier this month. Only a fraction will descend into obsessives. The thoughts of families and friends of gamers who have been affected by EverQuest can be found on one blog EverQuest Daily Grind. Jane, who runs the website, compiles a chronicle of heart-rending stories. \"I am actually convinced at this point that there are more than 'some' people who spend more times in MMOPRGs than in reality,\" she said. One unnamed correspondent - all are anonymous - wrote: \"On the rare nights when my husband does come to bed at the same time as I do, I find that I am so used to sleeping by myself that it is difficult to get to sleep with another body laying next to me.\n\n\"I can't talk to him while he is playing. There is absolutely no point as he doesn't hear me or is so distracted that I get a 'ummm... ya' a few minutes after I ask him a question.\"\n\n\"Gaming widows\" has become a comedic term for women who have been shut out by male gamers. But for some it is not in the least funny. Another correspondent wrote: \"I believe that he is addicted to the online gaming, and that is the cause of his depression and restlessness.\" And some of them are even sadder: \"Today our son was five days old. \"The sad truth is my husband spent 11 hours today playing his Warcraft game. He did not interact with our sweet tiny baby because there were important quests waiting online.\" Video game fans often complain that their hobby is misunderstood or marginalised. But as gaming becomes ever more mainstream, and games ever more immersive, there will be no hiding place for social problems.\n\n\n\nI wish 30-40 hours a week was unusual but I think it probably isn't. An 11 hour stretch isn't that surprising - I've known people to play 15+ hours at a stretch. I know of people who are spending their week's holiday from work playing Warcraft. I know of people who would play Ever[Crack] in shifts...waking at 3am to take over from their friends and resume waiting for an item they 'needed' to appear. I understand that the key sign of an addiction is if you alter your life around it rather than fit it into your life. By all standards many of us are addicts. So is the solution to force ourselves to stop playing..or do we just need to make real life a bit more interesting?\n\nSadly with all the talk of people becoming obsessed with gaming, I find myself longing to have the time to join them. I have been in a long term relationship for over 4 years - since that began, games have become more and more complex. And more and more so I find I have less and less time to play them, with and marriage and work being the main drag on my time.\n\nI think the line between playing a game a lot and a gaming addiction is really quite distinct. I play games a lot, definately over 20 hours a week, but I don't go missing work or other commitments in order to play games.\n\nI have, about a year ago, deleted every game on my computer. RPGs are the worst - the real world fades and all your worries sorround a new magic staff or mighty sword. Unlike books, or perhaps even TV, you gain absolutely nothing. When you stop playing you're at the same point as when you started; all the achievements of your 10 hour session are irretrievably locked in the game and, since you've gained nothing in the real world, you may as well pile on more achievement in the fake one.\n\nDespite having little monetary value, the \"rewards\" and encouragement offered by these MMORPGs is enough to hook games for hours daily. If only business could learn to leverage that very simply human need for easily measurable progress and recognition. Perhaps the unhealthily obsessed simply need more recognition for their achievements in reality?\n\nMy advice to gaming widows is \"if you can't beat 'em, join 'em\". That is, try playing it yourself. If he wants to play as well, well at least you'll be together somewhere...\n\nI was an addict and it cost me my relationship. I still play now, but without the guilt , hehe, How long have i played in one sitting? From morning till the early hours of the next day, the birds were singing out side and i had to hobble to the bath room cos my bladder was so full i was in pain, i would hardly eat, perhaps some toast, smoke endlessly and drink. Now, thankfully the fascination has worn off and I have a girlfriend but still no job. For the most part online gaming give me an adiction to illusory achievement, and as there is no end in sight you keep going for the mirage of the ultimate.\n\nObsessive behaviour is, of course, always cause for concern, but it always bothers me when articles about gaming talk in terms of \"reality\". Obviously, somebody who spends thirty hours a week playing EverQuest has a problem. This problem, however, has nothing to do with a dysfunctional sense of reality. An obsessive EQ player does not consider the game to be \"real\" any more than - for example - an obsessive automotive tinkerer considers their car to be human. If MMORPGs have a unique danger, in terms of encouraging obsessive behaviour, it is not that they create an absorbing virtual world, but rather that they can be easily accessed 24/7. The problem here does not lie with the nature of gaming, but with the nature of modern 24 hour culture.\n\nThe problem with these so called MMORPGS is that you can never really complete them, there's always another quest to do. A few of my friends have only had about 10 hours sleep since it was released friday...\n\nChampionship Manager consumed my life for years. One particular session started at about 2pm on a Sunday, paused for a brief sleep at 5am on the Monday and after visit to University for classes restarted at about midday for another 10 hour session. The people who tend to hark on about about the problems of \"hardcore gaming\" seem to be those who have rarely allowed themselves to become immersed in a game. I would expect their perspective to change if they were to do that.\n\nI used to be an EverQuest addict while I was in college. It came to the point where the gaming world felt more real than the real one. I failed alot of my courses and was able to barely graduate. I was lucky that I came to my senses when I did, others were less fortunate and dropped out of college. Now that I am holding a job, I avoid online RPGS like the plague.\n\nWhen I was made redundant I told my partner I had a new job for three months whilst every day I played EverQuest from 7:30am till 5:pm. When She came home I pretended I had just got in as well, hence justifying playing it all evening. I have since quit playing MMORPG and have a good job.\n\nWhen I got to the point where I was eating my dinner in front of the PC I realised things were getting silly so I'm trying not to spend so much time on there. It's not easy. I feel as if I've got a real addiction going on here.\n\nFor me the problem is that I love to complete a goal. Once it is completed that is it, I am finished, time to move on. I become obsessed to complete the goal, so from that standpoint it is an addiction. In a game where you will never complete an \"ultimate\" goal, well it would be like falling into a black pit. It is easier to escape into a controlled fantasy world than face reality at times - in other words the goal offered in the PC game are \"easier\" and more fun than the real world. Pretty scary implications if you think about it.\n\nI can't buy World of Warcraft as it would destroy my marrage, I just know it!!\n\nI played Star Wars Galaxies for about a year and can attest to the addictiveness of these games. They are all engineered in such a way that early on in the game you progress quickly, but this progress becomes exponentially slower, requiring more and more time to reach the next level. I'm sad to say that at the peak of my addiction I was spending entire weekends in front of my monitor, slowly building up my character, stopping only for food and toilet breaks. Thankfully I made a clean break, and actually managed to sell my Jedi account for £800 - which is my only sanity check in an otherwise completely unproductive time vacuum.\n\nSeven years ago, I began playing Ultima Online. This game dominated 2 years of my life. They were 2 wonderful years and I still have vivid memories of the experiences and friends I had. Online gaming can be a world of escapism where you can be yourself without fear of the thoughts of others. Something that cannot always be achieved in the day to day running of a normal life. Whilst I would warn against people giving to much of there life to these games, I believe they are a better way to spend your time than say watching TV.\n\nGaming is addictive and should be made a recognised addiction. When I was single I used to play upto eight hours a night after work every night for about a year, building up my stats, completing evermore quests and battling ogres. But somehow I found time to get out, even met someone and got married! Has my life changed? Hell no! I still cast spells and battle till the early hours of the morning. On with the fun!\n\nOnline gaming should be enjoyed just as much as you would enjoy watching television, or going to the cinema or the pub with your mates.\n\nMany people use recreational drugs on an occasional basis and are able to lead succesfull lives with families, relationships and good careers. A minority allow drugs to take over and destroy their lives and become addicted. According to this article the same is true of MMORPGs. The message to the government is clear, either legalise drugs, or outlaw online gaming!!\n\nSounds like there are some sad stories here - and I can believe them all. I play alot of Warcraft myself, and know full well how addictive it is. I am resolute that it will not take over my life. It certainly gets in the way though. I think that some people simply do not know how to draw this line, or lack the willpower to stop themselves stepping over it.\n\nI think I'm obsessed with gaming in general, I spend far too much time playing games like Everquest 2 and Football Manager rather than going out and interacting with real people and when I do try to, I'm always thinking in the back if my mind that I'd rather be in front of the computer winning the league with Cambridge United.\n\nI am obsessed with online role playing games. It's not so much quests but it has the adrenaline of a real life situation - goals to achieve etc. I spend about five hours per day online playing it and I rarely get more than four to five hours sleep before getting up for work the next morning...\n\nAs many of the players spend their time in MMORPGs rather than in front of the TV I fail to see how it will affect players social lives negatively. Furthermore these types of games contain a huge social aspect, whereas other games and some other pursuits (such as being a couch potato) the players could be indulging in are solitary by nature.\n\nThese games are like most things -- too much of anything is a bad thing, but as long as you can walk away from the computer to do other things too, they can be great fun.\n\nLiving in Korea at the moment, they have lots PC Bangs (Internet Cafes). Nearly most of South Koreans are addicted to online games, and one Korean died because of the lack of food and water he had through playing online games.\n\nI play xbox live every day. I find my self lying and rescheduling everything around my gaming fix. The longest I played was a 24 hour straight session. I know I play for to long but it's an obsession that I can't control. Can you reccomend a counsellor - this is not a wind up... but something I'm increasingly concerned with...\n\nMe and my mate play online for an hour or two a day, we're both aware of how much time can disappear by sitting in front of a TV, trying to 'frag' some individual. It's getting the balance between getting home and relasing the stress of a day by an hour or so gaming, and enjoying 'real' life...\n\nI bought the US version of World of Warcraft when it came out. The longest period I played was 23 hrs straight. I gave up the game after a month because it was so addictive, but have subsequently just bought the European version (couldn't help myself). In future, I'm going to regulate my time far more strictly. Great game!\n\nHaving played MMORPG games for some years I agree that these type of games can be life sucking. But my concern is for the younger generation of gamers that play for hours on end in an adult enviroment. Most MMORPG games you need a credit card to play but I dont think parents know just what they are letting there children into.\n\nUnless there is undeniable medical proof that staring at a computer screens for hours at a time can damage a person&#191;s health, you can expect this not to decline but to get worse.\n\nThese people are pathetic. They need to get off their machines and notice that our world is being swiftly overcome by issues and troubles that make the trifling worries of and \"online universe\" absolutely meaningless.\n\n24hours, when i was a kid at school and i was on half term, Ultima Online was the game, ahhhh them was the days ! LOL"

  1. Visualize the topic probabilities for example documents using boxplots.

# reorder titles in order of topic 1, topic 2, etc before plotting
lda_documents[lda_documents$document %in% c(1, 1000, 2000, 2225),] %>%
  mutate(document = reorder(document, gamma * topic)) %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_boxplot() +
  facet_wrap(~ document) +
  labs(x = "topic", y = expression(gamma)) +
  theme_minimal()


  1. Check the predicted topics against the provided labels (column: Category) in the original data frame.

results <- 
lda_documents %>% 
  mutate(document = as.numeric(document)) %>% 
  group_by(document) %>%  
  filter(gamma == max(gamma)) %>% 
  arrange(document) %>% 
  ungroup() %>% 
  left_join(rowid_to_column(select(df_final, Category), "document")) %>% 
  mutate(topic = if_else(topic == 1, "entertainment",
                         if_else(topic == 2, "business",
                                 if_else(topic == 3, "politics",
                                         if_else(topic == 4, "sport", "tech")))))

confusionMatrix(factor(results$topic), factor(results$Category))
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      business entertainment politics sport tech
##   business           440            17       22     7   26
##   entertainment       15           307       11    66   18
##   politics            53            24      351    34   15
##   sport                4            25        6   410   31
##   tech                12            30       33    13  314
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7977         
##                  95% CI : (0.7807, 0.814)
##     No Information Rate : 0.232          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7466         
##                                          
##  Mcnemar's Test P-Value : 8.137e-13      
## 
## Statistics by Class:
## 
##                      Class: business Class: entertainment Class: politics
## Sensitivity                   0.8397               0.7618          0.8298
## Specificity                   0.9591               0.9415          0.9323
## Pos Pred Value                0.8594               0.7362          0.7358
## Neg Pred Value                0.9526               0.9486          0.9602
## Prevalence                    0.2294               0.1764          0.1852
## Detection Rate                0.1926               0.1344          0.1537
## Detection Prevalence          0.2242               0.1826          0.2088
## Balanced Accuracy             0.8994               0.8517          0.8810
##                      Class: sport Class: tech
## Sensitivity                0.7736      0.7772
## Specificity                0.9624      0.9532
## Pos Pred Value             0.8613      0.7811
## Neg Pred Value             0.9336      0.9522
## Prevalence                 0.2320      0.1769
## Detection Rate             0.1795      0.1375
## Detection Prevalence       0.2084      0.1760
## Balanced Accuracy          0.8680      0.8652

Alternative LDA implementations

The LDA() function in the topicmodels package is only one implementation of the latent Dirichlet allocation algorithm. For example, the mallet package (Mimno 2013) implements a wrapper around the MALLET Java package for text classification tools, and the tidytext package provides tidiers for this model output as well. The textmineR package has extensive functionality for topic modeling. You can fit Latent Dirichlet Allocation (LDA), Correlated Topic Models (CTM), and Latent Semantic Analysis (LSA) from within textmineR (https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html).


Summary


In this practical, we learned about:

  • Topic modeling
  • LDA

End of Practical