r中的词干不能按预期工作

时间:2015-04-08 15:18:46

标签: r tm snowball

我正在尝试做一个非常简单的词来源于R并得到一些非常意外的东西。在下面的代码中'完成'变量是' NA'。为什么我不能简单地完成词干?

library(tm) 
library(SnowballC)
dict <- c("easy")
stem <- stemDocument(dict, language = "english")
complete <- stemCompletion(stem, dictionary=dict)

谢谢!

2 个答案:

答案 0 :(得分:1)

您可以使用stemCompletion()查看tm:::stemCompletion功能的内部。

function (x, dictionary, type = c("prevalent", "first", "longest", "none", "random", "shortest")){
if(inherits(dictionary, "Corpus")) 
  dictionary <- unique(unlist(lapply(dictionary, words)))
type <- match.arg(type)
possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s",w), dictionary, value = TRUE))
switch(type, first = {
  setNames(sapply(possibleCompletions, "[", 1), x)
}, longest = {
  ordering <- lapply(possibleCompletions, function(x) order(nchar(x), 
      decreasing = TRUE))
  possibleCompletions <- mapply(function(x, id) x[id], 
      possibleCompletions, ordering, SIMPLIFY = FALSE)
  setNames(sapply(possibleCompletions, "[", 1), x)
}, none = {
  setNames(x, x)
}, prevalent = {
  possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), 
      decreasing = TRUE))
  n <- names(sapply(possibleCompletions, "[", 1))
  setNames(if (length(n)) n else rep(NA, length(x)), x)
}, random = {
  setNames(sapply(possibleCompletions, function(x) {
      if (length(x)) sample(x, 1) else NA
  }), x)
}, shortest = {
  ordering <- lapply(possibleCompletions, function(x) order(nchar(x)))
  possibleCompletions <- mapply(function(x, id) x[id], 
      possibleCompletions, ordering, SIMPLIFY = FALSE)
  setNames(sapply(possibleCompletions, "[", 1), x)
})

}

x参数是您的词干,dictionary是未被限制的。唯一重要的是第五行;它对字典术语列表中的词干进行简单的正则表达式匹配。

possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s",w), dictionary, value = TRUE))

因此它失败了,因为它无法找到&#34; easi&#34;用&#34;轻松&#34;。如果你也有'#34;最简单的&#34;在你的字典中,两个术语都匹配,因为现在有一个字典单词,开头有四个相同的字母。

library(tm) 
library(SnowballC)
dict <- c("easy","easiest")
stem <- stemDocument(dict, language = "english")
complete <- stemCompletion(stem, dictionary=dict)
complete
     easi   easiest 
"easiest" "easiest" 

答案 1 :(得分:0)

wordStem()似乎这样做..

library(tm) 
library(SnowballC)
dict <- c("easy")
> wordStem(dict)
[1] "easi"