R：文本分析 - tm包 - stemComplete错误

时间：2015-02-20 01:11:35

标签： regex r text tm stemming

机器：Windows 7 - 64位 R版本：R版本3.1.2（2014-10-31） - “南瓜头盔”

我正在努力为我正在进行的分析编制一些文本，我能够一直完成所有工作，直到'stemComplete'为了更多背景，请参阅下面的内容;

的软件包：

TM
SnowballC
rJava
RWeka
Rwekajars
NLP

单词样本列表

test <- as.vector(c('win', 'winner', 'wins', 'wins', 'winning'))

转换为语料库

Test_Corpus <- Corpus(VectorSource(test))

文本操作`

Test_Corpus <- tm_map(Survey_Corpus, content_transformer(tolower))
Test_Corpus <- tm_map(Survey_Corpus, removePunctuation)
Test_Corpus <- tm_map(Survey_Corpus, removeNumbers)

使用tm包

>Test_stem <- tm_map(Test_Corpus, stemDocument, language = 'english' )

以下是上述词干的结果，到目前为止一切正确：

赢
赢家
赢
赢
赢

现在问题来了！当我尝试使用test_corpus作为字典时，使用以下代码将单词转换回适当的格式;

>Test_complete <- tm_map(Test_stem, stemCompletion, Test_Corpus)

以下是我收到的错误消息：

警告信息：

1: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be  used
2: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
3: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
4: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
5: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used

我已经尝试过以前帖子中列出的几件事，看到其他有同样问题的人试过没有运气。以下是这些事项的清单：

更新Java
used content_transformation
使用了PlainTextDocument

1 个答案:

答案 0 :(得分：0)

我认为你需要在词干过程之前将test_corpus保存为字典。您可以尝试类似Test_Corpus <- corpus之类的内容，然后您可以在Test_complete <- tm_map(corpus, stemCompletion)中开始使用语料库并使用语料库。