文字堵塞后字频率不准确

时间:2017-02-20 18:51:26

标签: r hunspell

感谢您抽出宝贵时间阅读我的帖子。新手在这里,这是我的第一个带有一些样本数据的R脚本。

library(tm)
library(hunspell)
library(stringr)

docs <- VCorpus(VectorSource('He is a nice player, She could be a better player. Playing basketball is fun. Well played! We could have played better. Wish we had better players!'))

input <- strsplit(as.character(docs), " ")
input <- unlist(input)
input <- hunspell_stem(input)
input <- word(input,-1)

input <- VCorpus(VectorSource(input))
docs <- input

docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
sort(rowSums(m),decreasing=TRUE)

返回以下结果:

  

character0 48更好3打3篮球1描述1
  有趣1头1小时1语言1元1分1好1   原产地1井1希望1年1

预期结果:

  

更好3玩3篮球1乐趣1语言1不错1井1   希望1

不确定这些词的来源(character0,description,meta,language等)以及是否有办法摆脱它们?

基本上我要做的就是在hunspell的语料库(数据源sql server表)上应用词干,然后在文字云中显示它们。任何帮助,将不胜感激。 GD

1 个答案:

答案 0 :(得分:0)

以下是评论中的示例失败的原因:

library(tm) 
library(hunspell) 
hunspell_stem(strsplit('Thanks lukeA for your help!', "\\W")[[1]])
# [[1]]
# [1] "thank"
# 
# [[2]]
# character(0)
# 
# [[3]]
# [1] "for"
# 
# [[4]]
# [1] "your"
# 
# [[5]]
# [1] "help"

这是让它发挥作用的一种方法:

docs <- VCorpus(VectorSource('Thanks lukeA for your help!')) 
myStem <- function(x) { 
  res <- hunspell_stem(x)
  idx <- which(lengths(res)==0)
  if (length(idx)>0)
    res[idx] <- x[idx]
  sapply(res, tail, 1) 
}
dtm <- TermDocumentMatrix(docs, control = list(stemming = myStem)) 
m <- as.matrix(dtm) 
sort(rowSums(m),decreasing=TRUE)
  # for help! lukea thank  your 
  #   1     1     1     1     1 

如果没有茎,这将返回原始标记,如果有多个茎,则返回最后一个茎。