在Latent Dirichlet分配后获得重复的术语

时间:2016-03-10 16:14:10

标签: r text-mining

我正在尝试使用Latent Dirichlet分配实现但是重复了一些术语。我如何才能从LDA获得独特的术语?

  

库(TM)
  装载所需包裹:NLP
  myCorpus< - Corpus(VectorSource(tweets $ text))
  myCorpus< - tm_map(myCorpus,content_transformer(tolower))
  removeURL< - function(x)gsub(" http [^ [:space:]] ","",x)
  myCorpus< - tm_map(myCorpus,content_transformer(removeURL))
  removeNumPunct< - function(x)gsub(" [^ [:alpha:] [:space:]]
","",x)
  myCorpus< - tm_map(myCorpus,content_transformer(removeNumPunct))
  myStopwords< - c(停用词('英语'),"可用","通过")
  myStopwords< - setdiff(myStopwords,c(" r"," big"))
  myCorpus< - tm_map(myCorpus,removeWords,myStopwords)
  myCorpus< - tm_map(myCorpus,stripWhitespace)
  myCorpusCopy< - myCorpus
  myCorpus< - tm_map(myCorpus,stemDocument)
  库(' SnowballC')
  myCorpus< - tm_map(myCorpus,stemDocument)
  DTM< -DocumentTermMatrix(myCorpus)
  图书馆(" RTextTools",lib.loc ="〜/ R / win-library / 3.2")
  图书馆(" topicmodels",lib.loc ="〜/ R / win-library / 3.2")
  OM1< -LDA(DTM,30)
  条款(om1)

This is the output

2 个答案:

答案 0 :(得分:1)

根据https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation在LDA中,每个文件都被视为各种主题的混合。这是针对每个文档(推文)我们得到属于每个主题的推文的概率。概率总和为1.

同样,每个主题都被视为各种术语(单词)的混合。这是针对每个主题,我们得到每个单词属于该主题的概率。概率总和为1。 因此,对于每个单词主题组合,存在分配的概率。代码terms(om1)获取每个主题的概率最高的单词。

因此,在您的情况下,您在多个主题中找到具有最高概率的相同单词。这不是错误。

以下代码将创建 TopicTermdf 数据集,其中包含每个主题的所有单词的分布。查看数据集,将有助于您更好地理解。

以下代码基于以下LDA with topicmodels, how can I see which topics different documents belong to?帖子。

<强>代码:

# Reproducible data - From Coursera.org John Hopkins Data Science Specialization Capstone project, SwiftKey Challange dataset

tweets <- c("How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.",
           "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.",
           "they've decided its more fun if I don't.",
           "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)",
           "Words from a complete stranger! Made my birthday even better :)",
           "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!",
           "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing",
           "I'm coo... Jus at work hella tired r u ever in cali",
           "The new sundrop commercial ...hehe love at first sight",
           "we need to reconnect THIS WEEK")


library(tm)
myCorpus <- Corpus(VectorSource(tweets))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
removeURL <- function(x) gsub("http[^[:space:]]", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
myStopwords <- c(stopwords('english'), "available", "via")
myStopwords <- setdiff(myStopwords, c("r", "big"))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpusCopy <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)
library('SnowballC')
myCorpus <- tm_map(myCorpus, stemDocument)
dtm<-DocumentTermMatrix(myCorpus)

library(RTextTools)
library(topicmodels)
om1<-LDA(dtm,3)

<强>输出:

> # Get the top word for each topic 
> terms(om1) 
Topic 1 Topic 2 Topic 3 
"youll"   "cub" "anoth" 
> 
> #Top word for each topic
> colnames(TopicTermdf)[apply(TopicTermdf,1,which.max)]
[1] "youll" "cub"   "anoth"

> 

答案 1 :(得分:0)

尝试找到最佳主题数。为此,您需要构建具有不同主题数的多个LDA模型,并选择具有最高一致性得分的模型。 如果您看到在多个主题中重复使用相同的关键字(字词),则可能表明k(主题数)的值太大。尽管它是用python编写的,但在link to LDA topic modeling中,您会找到网格搜索方法来找到最佳值(以确定要采取的许多主题)。