Question

在进行文本挖掘时，我从文本语料库中删除了停用词时出错，其中包含500个文档。我在Ubuntu 14.04 LTS和文本挖掘包0.6-1上使用R 3.1.3。这是代码，请帮忙。

unsup.corpus = Corpus(DirSource(directory.location, encoding = "UTF-8"),
                      readerControl = list(language = "en_US"))


document.collection = unsup.corpus    
document.collection = tm_map(document.collection, stripWhitespace, mc.cores = 1)    
document.collection = tm_map(document.collection, content_transformer(tolower), mc.cores = 1)    
document.collection = tm_map(document.collection, removeNumbers, mc.cores = 1)    
document.collection = tm_map(document.collection, removePunctuation, mc.cores = 1)

document.collection = tm_map(document.collection, removeWords, stopwords("english"), mc.cores = 1)

######错误＃
gsub中的错误（sprintf（“（* UCP）\ b（％s）\ b”，粘贴（排序（单词，减少） = TRUE）,:输入字符串21无效UTF-8

Answer 1

你可以做的一件事是

document.collection = 
        tm_map(document.collection[-21], removeWords, stopwords("english"), mc.cores = 1)

这摆脱了＆＃34;字符串＆＃34;有问题的人物。

如果您想独立处理问题字符串，可以直接调用

document.collection[-21]

并对具体细节进行一些调查。

从R文本挖掘中的文本语料库中删除“英语”停用词时，UTF-8无效

1 个答案: