R中词典中的完全匹配单词和词典中的单词

时间:2015-02-18 11:29:40

标签: r

我有以下data.frame句子和pos / negWords代表字典:

sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
                         "wouldnt bad notebook", "very good quality", "orgtop",
                         "great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
               stringsAsFactors=F)

posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
          "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great")
negWords <- c("hate","bad","not good","horrible")

然后我有一个脚本,它在句子中用pos / negWords匹配单词中的单词。当单词匹配时,算法选择匹配的pos / negWords的最近邻居中的单词 - 我的意思是来自特定句子中匹配的pos / neg单词的+/- 2个单词:

counter <- 0
dataOut <- ldply(strsplit(as.character(sent$words), " "), 
             function(x) {
               counter <<- counter + 1
               p = which(x %in% posWords)
               n = which(x %in% negWords)
               positive <- vapply(p, function(i) paste0(c(x[i - 2], x[i - 1], x[i], x[i + 1], x[i + 2]), collapse = " "), character(1))
               negative <- vapply(n, function(i) paste0(c(x[i - 2], x[i - 1], x[i], x[i + 1], x[i + 2]), collapse = " "), character(1))
               if(length(positive) > 0 | length(negative) > 0) {
                 cbind(user = counter, word = c(positive, negative), val = rep(c(1, -1), c(length(p), length(n))))
               }
             })

但问题是,我希望完全匹配以避免这样的事情,例如“不好” - 算法选择好并将其评估为带+1的正数,然后选择最近邻居(+/-来自该匹配单词的单词)。但是,如果我把“不好”放入负面字典中,我需要将其设为-1,然后从该多个术语中搜索最近的邻居。

我的方法存在的问题是这一部分:

strsplit(as.character(sent$words), " ")

在每个句子中吐出的单词作为唯一的。当我无法进行完全匹配时,这是一个问题...

拜托,任何人都可以帮助我。我真的不知道,怎么做。我将非常感谢您的帮助或建议。非常感谢你提前。

0 个答案:

没有答案