我正在尝试获取某个关键字(例如“ zomato”)的推文,并尝试对所获取的推文进行主题建模。 以下是获取推文的搜索功能。
search <- function(searchterm)
{
#access tweets and create cumulative file
list <- searchTwitter(searchterm, n=25000)
df <- twListToDF(list)
df <- df[, order(names(df))]
df$created <- strftime(df$created, '%Y-%m-%d')
if (file.exists(paste(searchterm, '_stack.csv'))==FALSE) write.csv(df, file=paste(searchterm, '_stack.csv'), row.names=F)
#merge last access with cumulative file and remove duplicates
stack <- read.csv(file=paste(searchterm, '_stack.csv'))
stack <- rbind(stack, df)
stack <- subset(stack, !duplicated(stack$text))
return(stack)
}
ZomatoResults<- search('Zomato')
此后,我将进行正常的推文清除,并将其存储在变量“ ZomatoCleaned”中。我还没有添加那段代码。然后,我形成语料库,进行主题建模,如下所示
options(mc.cores = 1) # or whatever
tm_parLapply_engine(parallel::mclapply)
corpus <- Corpus(VectorSource(ZomatoCleaned)) # Create corpus object
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stemDocument)
pal <- brewer.pal(8, "Dark2")
dev.new(width = 1000, height = 1000, unit = "px")
wordcloud(corpus, min.freq=2, max.words = 100, random.order = TRUE, col = pal)
dat <- DocumentTermMatrix(corpus)
dput(head(dat))
doc.lengths <- rowSums(as.matrix(DocumentTermMatrix(corpus)))
dtm <- DocumentTermMatrix(corpus[doc.lengths > 0])
# model <- LDA(dtm, 10) # Go ahead and test a simple model if you want
SEED = sample(1:1000000, 1) # Pick a random seed for replication
k = 10 # Let's start with 10 topics
models <- list(
CTM = CTM(dtm, k = k, control = list(seed = SEED, var = list(tol = 10^-4), em = list(tol = 10^-3))),
VEM = LDA(dtm, k = k, control = list(seed = SEED)),
VEM_Fixed = LDA(dtm, k = k, control = list(estimate.alpha = FALSE, seed = SEED)),
Gibbs = LDA(dtm, k = k, method = "Gibbs", control = list(seed = SEED, burnin = 1000,
thin = 100, iter = 1000))
)
lapply(models, terms, 10)
assignments <- sapply(models, topics)
head(assignments, n=10)
不幸的是
doc.lengths <- rowSums(as.matrix(DocumentTermMatrix(corpus)))
我遇到错误“指定的向量大小在R中太大”或“无法分配大小为36.6Gb的向量”。我正在使用8Gb Ram系统和Rstudio 3.5.2 我已经运行gc()命令并尝试设置memory.limit(),但没有帮助。有一些解决方法可以处理此数据集吗?我知道这是内存问题,但请就如何解决这种情况提供帮助
数据的O / P:结构(c(0,1,0,0,0,0),权重= c(“项频”, “ tf”),class = c(“ DocumentTermMatrix”,“ simple_triplet_matrix”))