Question

我正在使用Twitter数据，我正在尝试查找bigrams的频率，其中第一个单词是“the”。我写了一个似乎正在做我想要的功能，但速度非常慢（最初我想看到所有双字母的频率，但我因为速度而放弃）。有没有更快的方法来解决这个问题？我听说过RWeka软件包，但是在安装它时遇到了麻烦，我收到了一个错误（错误：依赖关系'RWekajars'，'rJava'不适用于软件包'RWeka'）...

required libraries: tau and tcltk

bigramThe <- function(dataset,column) {
    bidata <- data.frame(x= character(0), y= numeric(0))
    pb <- tkProgressBar(title = "progress bar", min = 0,max = nrow(dataset), width = 300)
    for (i in 1:nrow(dataset)) {
        a <- column[i]
        bi<-textcnt(a, n = 2, method = "string")
        tweetbi <- data.frame(V1 = as.vector(names(bi)), V2 = as.numeric(bi))
        tweetbi$grepl<-grepl("the ",tweetbi$V1)
        tweetbi<-tweetbi[which(tweetbi$grepl==TRUE),]
        bidata <- rbind(bidata, tweetbi)
        setTkProgressBar(pb, i, label=paste( round(i/nrow(dataset), 0), "% done"))}
    aggbi<-aggregate(bidata$V2, by=list(bidata $V1), FUN=sum)
    close(pb)
    return(aggbi)
}

我将近500,000行的推文存储在我传递给该函数的列中。示例数据集如下所示：

text                userid

tweet text 1           1
tweets text 2          2
the tweet text 3       3

Answer 1

要使用RWeka，首先运行sudo apt-get install openjdk-6-jdk（或在Windows GUI中安装/重新安装JDK），然后尝试重新安装软件包。

如果失败，请使用download.file下载源.zip文件并从源代码安装，即install.packages("RWeka.zip", type = "source", repos = NULL)。

如果您想在不使用其他软件包的情况下加快速度，请考虑使用multicore并重新编写代码以使用可以利用并行性的apply函数。

Answer 2

您可以通过将文本列折叠为一个长字符串来摆脱邪恶的循环结构：

paste(dataset[[column]], collapse=" *** ")
bi<-textcnt(a, n = 2, method = "string")

我预计也需要子集（bi，function（x）！grepl（“ * ”，x）

但事实证明textcnt方法不包含带有 * 的bigrams，所以你很高兴。

R中的慢二元频率函数

2 个答案: