Question

我正在尝试使用stemCompletion将词干转换成完整的单词。

以下是我正在使用的代码

txt <- c("Once we have a corpus we typically want to modify the documents in it",
     "e.g., stemming, stopword removal, et cetera.",
     "In tm, all this functionality is subsumed into the concept of a transformation.")

myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus

# *Removing common word endings* (e.g., "ing", "es") 
myCorpus.stemmed <- tm_map(myCorpus, stemDocument, language = "english")
myCorpus.unstemmed <- tm_map(myCorpus.stemmed, stemCompletion, dictionary=myCorpusCopy)

如果我检查词干语料库的第一个元素，它会正确显示元素

myCorpus.stemmed[[1]][1]
$content
[1] "onc we have a corpus we typic want to modifi the document in it"

但是，如果我检查未受干扰的语料库的第一个元素，它会抛出垃圾

myCorpus.unstemmed[[1]][1]
$content
[1] NA

为什么未受干扰的语料库没有显示正确的内容？

Answer 1

为什么未受干扰的语料库没有显示正确的内容？

由于你有一个简单的语料库对象，你实际上是在调用

container.update();

产生

stemCompletion(
  x = c("once we have a corpus we typically want to modify the documents in it", 
        "eg stemming stopword removal et cetera", 
        "in tm all this functionality is subsumed into the concept of a transformation"),
  dictionary=myCorpusCopy
)

由于# once we have a corpus we typically want to modify the documents in it # NA # eg stemming stopword removal et cetera # NA # in tm all this functionality is subsumed into the concept of a transformation # NA等待词干的字符向量作为第一个参数（stemCompletion），而不是词干文本的字符向量（c("once", "we", "have")）。

如果你想完成你的语料库中的词干，不管这应该是什么好的，你必须将单个词干的字符向量传递给c("once we have")（即将每个文本文档标记为词干，完成干，然后再将它们粘在一起）。

Answer 2

我对TM只有一点点熟悉，但是没有干完成要求令牌是词干而不是已经完成的词。

Answer 3

感谢Luke给出的答案，我找了一个可以帮助将示例文本转换为字符向量的函数。

我遇到了this answer的另一个问题，该问题提供了一个自定义函数，可以在应用stemCompletion函数之前将文本转换为单个单词。

stemCompletion_mod <- function(x,dict=dictCorpus) {
 PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

我将该函数与lapply结合使用以获取包含未编辑版本的列表。这将返回正确的值，但不在SimpleCorpus数据类型中！我需要对输出列表进行一些处理，将其转换为SimpleCorpus数据类型。

myCorpus.unstemmed <- lapply(myCorpus.stemmed, stemCompletion_mod, dict = myCorpusCopy)

> myCorpus.stemmed[[1]][1]
$content
[1] "onc we have a corpus we typic want to modifi the document in it"

 > myCorpus.unstemmed[[1]][1]
$content
[1] "once we have a corpus we typically want to the documents in it"

我不知道为什么stemCompletion没有完成修改。但这将是另一个需要探索的问题的一部分。

stemCompletion无法正常工作

3 个答案: