Question

我有以下数据框：

sent1 = data.frame(Sentences=c("abundant bad abnormal activity was     accomodative due to 2-face people","strange exciting activity was due to great 2-face people"), user = c(1,2))

跟随pos和neg的话。

pos = c("abound" , "abounds", "abundant", "exciting", "great")
neg = c("2-face","abnormal", "strange", "bad", "weird")

然后我有下面的代码，它们在每个句子中分割出独特的单词，然后将它们与pos和neg词典中的单词进行匹配。 Pos字等于1，neg字等于-1。

words = (str_split(unlist(sent1$Sentences)," "))

tmp <- data.frame()
tmn <- data.frame()

for (i in 1:nrow(sent1)) {
  for (j in 1:length(words[[i]])) {
    for (k in 1:length(pos)){
      if (words[[i]][j] == pos[k]) {

        tmn <- cbind(i,paste(words[[i]][j-1],words[[i]][j],words[[i]][j+1],sep=" "),1)
        tmp <- rbind(tmp,tmn)
      }
    }
    for (m in 1:length(neg)){
      if (words[[i]][j] == neg[m]) { 
        tmn <- cbind(i,paste(words[[i]][j-1],words[[i]][j],words[[i]][j+1],sep=" "),-1)
        tmp <- rbind(tmp,tmn)
      }
    }  
  }
}

如果我有1000个句子，大约需要10分钟...如果我有1000万行，我可以去度假。你能给我一些建议，如何加快这种方法或如何避免循环...... 非常感谢你提前。

必需的输出：

user      matched word and it's neighbour             sentimentScore
1         abundant bad                                      1
1         abundant bad abnormal                            -1
1         bad abnormal activity                            -1
1         was accomodative due                              1
1         to 2-face people                                 -1
2         strange exciting                                 -1
2         strange exciting activity                         1
2         to great 2-face                                   1
2         great 2-face people                              -1

Answer 1

您可以在stringr中使用str_count函数

library(stringr)
posReg <- paste(pos, collapse="| ")
str_count(sprintf("%s ", as.character(sent1$Sentences)), posReg)

这将生成一个匹配任何正数的正则表达式，后跟一个空格。然后它计算每个句子中这个正则表达式的匹配数。如果你在句子的末尾有一个关键字，我会添加一个空格以确保它匹配。它不会处理标点符号等，因此您需要注意这一点，但这不是原始问题的一部分。

Answer 2

您的代码的主要缺点是：（1）您不预先分配您的结果，然后填写它们;即使您不确定最终的length / nrow / ncol等，也最好提前分配额外的空间。（2）您为每个不同的单词“阅读”pos和neg，而您可以使用match（或其%in%）;例如比较表现：

x = sample(letters, 1e3, T); table = letters[c(1:3, 10, 15)]
identical(x %in% table, unlist(lapply(x, function(y) any(y == table))))
#[1] TRUE
microbenchmark::microbenchmark(x %in% table, unlist(lapply(x, function(y) any(y == table))))
#Unit: microseconds
#                                           expr      min        lq   median      uq       max neval
#                                   x %in% table   31.320   32.0165   33.408   34.80    42.457   100
# unlist(lapply(x, function(y) any(y == table))) 1310.925 1388.5290 1429.768 1536.43 45416.740   100

解决此问题的方法可能是：

tmp = lapply(strsplit(as.character(sent1$Sentences), " "), 
             function(x) {
                p = which(x %in% pos)
                n = which(x %in% neg)
                data.frame(word = c(unlist(lapply(p, function(i) paste0(c(x[i - 1], x[i], x[i + 1]), collapse = " "))),
                                    unlist(lapply(n, function(i) paste0(c(x[i - 1], x[i], x[i + 1]), collapse = " ")))),
                           val = rep(c(1, -1), c(length(p), length(n))))
             }) 
cbind(user = rep(sent1$user, sapply(tmp, nrow)), do.call(rbind, tmp))            
#  user                      word val
#1    1              abundant bad   1
#2    1     abundant bad abnormal  -1
#3    1     bad abnormal activity  -1
#4    1          to 2-face people  -1
#5    2 strange exciting activity   1
#6    2           to great 2-face   1
#7    2          strange exciting  -1
#8    2       great 2-face people  -1

如何避免r中的for循环

2 个答案: