我有以下数据框:
sent1 = data.frame(Sentences=c("abundant bad abnormal activity was accomodative due to 2-face people","strange exciting activity was due to great 2-face people"), user = c(1,2))
跟随pos和neg的话。
pos = c("abound" , "abounds", "abundant", "exciting", "great")
neg = c("2-face","abnormal", "strange", "bad", "weird")
然后我有下面的代码,它们在每个句子中分割出独特的单词,然后将它们与pos和neg词典中的单词进行匹配。 Pos字等于1,neg字等于-1。
words = (str_split(unlist(sent1$Sentences)," "))
tmp <- data.frame()
tmn <- data.frame()
for (i in 1:nrow(sent1)) {
for (j in 1:length(words[[i]])) {
for (k in 1:length(pos)){
if (words[[i]][j] == pos[k]) {
tmn <- cbind(i,paste(words[[i]][j-1],words[[i]][j],words[[i]][j+1],sep=" "),1)
tmp <- rbind(tmp,tmn)
}
}
for (m in 1:length(neg)){
if (words[[i]][j] == neg[m]) {
tmn <- cbind(i,paste(words[[i]][j-1],words[[i]][j],words[[i]][j+1],sep=" "),-1)
tmp <- rbind(tmp,tmn)
}
}
}
}
如果我有1000个句子,大约需要10分钟...如果我有1000万行,我可以去度假。你能给我一些建议,如何加快这种方法或如何避免循环...... 非常感谢你提前。
必需的输出:
user matched word and it's neighbour sentimentScore
1 abundant bad 1
1 abundant bad abnormal -1
1 bad abnormal activity -1
1 was accomodative due 1
1 to 2-face people -1
2 strange exciting -1
2 strange exciting activity 1
2 to great 2-face 1
2 great 2-face people -1
答案 0 :(得分:0)
您可以在stringr中使用str_count
函数
library(stringr)
posReg <- paste(pos, collapse="| ")
str_count(sprintf("%s ", as.character(sent1$Sentences)), posReg)
这将生成一个匹配任何正数的正则表达式,后跟一个空格。然后它计算每个句子中这个正则表达式的匹配数。如果你在句子的末尾有一个关键字,我会添加一个空格以确保它匹配。它不会处理标点符号等,因此您需要注意这一点,但这不是原始问题的一部分。
答案 1 :(得分:0)
您的代码的主要缺点是:(1)您不预先分配您的结果,然后填写它们;即使您不确定最终的length
/ nrow
/ ncol
等,也最好提前分配额外的空间。 (2)您为每个不同的单词“阅读”pos
和neg
,而您可以使用match
(或其%in%
);例如比较表现:
x = sample(letters, 1e3, T); table = letters[c(1:3, 10, 15)]
identical(x %in% table, unlist(lapply(x, function(y) any(y == table))))
#[1] TRUE
microbenchmark::microbenchmark(x %in% table, unlist(lapply(x, function(y) any(y == table))))
#Unit: microseconds
# expr min lq median uq max neval
# x %in% table 31.320 32.0165 33.408 34.80 42.457 100
# unlist(lapply(x, function(y) any(y == table))) 1310.925 1388.5290 1429.768 1536.43 45416.740 100
解决此问题的方法可能是:
tmp = lapply(strsplit(as.character(sent1$Sentences), " "),
function(x) {
p = which(x %in% pos)
n = which(x %in% neg)
data.frame(word = c(unlist(lapply(p, function(i) paste0(c(x[i - 1], x[i], x[i + 1]), collapse = " "))),
unlist(lapply(n, function(i) paste0(c(x[i - 1], x[i], x[i + 1]), collapse = " ")))),
val = rep(c(1, -1), c(length(p), length(n))))
})
cbind(user = rep(sent1$user, sapply(tmp, nrow)), do.call(rbind, tmp))
# user word val
#1 1 abundant bad 1
#2 1 abundant bad abnormal -1
#3 1 bad abnormal activity -1
#4 1 to 2-face people -1
#5 2 strange exciting activity 1
#6 2 to great 2-face 1
#7 2 strange exciting -1
#8 2 great 2-face people -1