使用单个协变量为主题建模运行stm的问题

时间:2019-03-06 13:08:34

标签: r topic-modeling

我正在尝试使用stm运行LDA主题建模分析,但是我的元数据有问题,它似乎可以正常工作,但是我的协变量(Age)未被读取,如本示例所示。

我有一些tweet(excel文件中的docu列),带有年龄协变量(Young,Old)值。

这是我的数据 http://www.mediafire.com/file/5eb9qe6gbg22o9i/dada.xlsx/file

library(stm)
library(readxl)
library(quanteda)
library(stringr)
library(tm)


data <-  read_xlsx("C:/dada.xlsx") 

#Remove URL's 
data$docu <- str_replace_all(data$docu, "https://t.co/[a-z,A-Z,0-9]*","")


data$docu <- gsub("@\\w+", " ", data$docu)  # Remove user names (all proper names if you're wise!)

data$docu <- iconv(data$docu, to = "ASCII", sub = " ")  # Convert to basic ASCII text to avoid silly characters
data$docu <- gsub("#\\w+", " ", data$docu)

data$docu <- gsub("http.+ |http.+$", " ", data$docu)  # Remove links

data$docu <- gsub("[[:punct:]]", " ", data$docu)  # Remove punctuation)

data$docu<-  gsub("[\r\n]", "", data$docu)

data$docu <- tolower(data$docu)



#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
data$docu <- tm::removeWords(x = data$docu, c(stopwords(kind = "SMART")))

data$docu <- gsub(" +", " ", data$docu) # General spaces (should just do all whitespaces no?)

myCorpus <- corpus(data$docu)
docvars(myCorpus, "Age") <- as.factor(data$Age)


processed <- textProcessor(data$docu, metadata = data)

out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 2)

out$documents
out$meta
levels(out$meta)

First_STM <- stm(documents = out$documents, vocab = out$vocab,
                 K = 4, prevalence =~ Age ,
                 max.em.its = 25, data = out$meta,
                 init.type = "LDA", verbose = FALSE)

如我试图将年龄定义为因素的代码所示,我认为不需要这样做,因为运行textProcessor可能就足够了。 levels(out$meta)得到了NULL的值,所以当我运行stm以获取实际的主题时,我会得到内存分配错误。

1 个答案:

答案 0 :(得分:2)

您在此行中将docvars(myCorpus, "Age") <- as.factor(data$Age) 的元变量设置为因子

data

但是您不再使用myCorpus。在接下来的步骤中,将数据框Age用于预处理。尝试将数据帧中的data$Age <- factor(data$Age) 定义为因子:

processed <- textProcessor(data$docu, metadata = data)

out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 2)

然后在此处之前使用它

levels(out$meta$Age)

然后您可以查看以下级别:

do for [i=1:imax]{
   imagefile='M'.sprintf("%5.5i",i).'.png'
   datafile='A'.sprintf("%5.5i",j).'.dat'
   plot datafile u 2:3:(rad*$6) with circles lc rgb "black" lw 3
   pause 0
}

我无法重现您的内存分配错误。该stm在我的机器上运行正常(Win 10 Pro,8GB Ram)。