Question

我在我的禁用词列表中添加了几个单词但是当我处理它并查看单词频率时，其中一个单词似乎被卡住了。

myStopwords <- c(stopwords('english'), "glove", "kgi")
corp <- tm_map(corp, removeWords, myStopwords)

然后我创建一个TDM并运行一个单词频率并弹出'手套'。还有其他变种，如'glovechina'和我希望在那里的'手套'，但不是'手套'本身。我错过了什么吗？

来自CSV来源的行示例：

KGI 999 SZ11 GLOVE皮革PROT LOW
凯基证券10054BC10 GLOVE SZ 10.5
SAL ILPG10A1010H手套LTHR

代码：

corp <- Corpus(DataframeSource(x))
corp <- tm_map(corp, tolower)
corp <- tm_map(corp, PlainTextDocument)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeNumbers)
myStopwords <- c(stopwords('english'), "glove", "kgi")
corp <- tm_map(corp, removeWords, myStopwords)
corp <- tm_map(corp, stemDocument)
corp <- tm_map(corp, stripWhitespace)

tdm <- TermDocumentMatrix(corp)

# print terms
dimnames(tdm)$Terms
save(tdm, file="tdm.RData")
# frequent terms
which(apply(tdm, 1, sum) > 20)
findFreqTerms(tdm, lowfreq=20)

添加了停用词，但似乎不起作用

0 个答案: