我有以下data.frame句子和pos / negWords代表字典:
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "orgtop",
"great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great")
negWords <- c("hate","bad","not good","horrible")
然后我有一个脚本,它在句子中用pos / negWords匹配单词中的单词。当单词匹配时,算法选择匹配的pos / negWords的最近邻居中的单词 - 我的意思是来自特定句子中匹配的pos / neg单词的+/- 2个单词:
counter <- 0
dataOut <- ldply(strsplit(as.character(sent$words), " "),
function(x) {
counter <<- counter + 1
p = which(x %in% posWords)
n = which(x %in% negWords)
positive <- vapply(p, function(i) paste0(c(x[i - 2], x[i - 1], x[i], x[i + 1], x[i + 2]), collapse = " "), character(1))
negative <- vapply(n, function(i) paste0(c(x[i - 2], x[i - 1], x[i], x[i + 1], x[i + 2]), collapse = " "), character(1))
if(length(positive) > 0 | length(negative) > 0) {
cbind(user = counter, word = c(positive, negative), val = rep(c(1, -1), c(length(p), length(n))))
}
})
但问题是,我希望完全匹配以避免这样的事情,例如“不好” - 算法选择好并将其评估为带+1的正数,然后选择最近邻居(+/-来自该匹配单词的单词)。但是,如果我把“不好”放入负面字典中,我需要将其设为-1,然后从该多个术语中搜索最近的邻居。
我的方法存在的问题是这一部分:
strsplit(as.character(sent$words), " ")
在每个句子中吐出的单词作为唯一的。当我无法进行完全匹配时,这是一个问题...
拜托,任何人都可以帮助我。我真的不知道,怎么做。我将非常感谢您的帮助或建议。非常感谢你提前。