tm包:stemCompletion无效

时间:2017-01-16 10:10:04

标签: r text-mining tm stemming text-analysis

我有一个简单的代码来执行文本分析。在创建DTM之前,我正在应用stemCompletion。然而,这是我不理解的东西,不管我做错了,或者这是它表现的唯一方式。

我已经参考了rmy help的这个链接:text-mining-with-the-tm-package-word-stemming

我在这里看到的问题是,在阻止之后,我的DTm缩小并且根本不返回标记(返回'content''meta')

我的代码和输出:

texts <- c("i am member of the XYZ association",
           "apply for our open associate position", 
           "xyz memorial lecture takes place on wednesday", 
           "vote for the most popular lecturer")

myCorpus <- Corpus(VectorSource(texts))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation) 
myCorpus <- tm_map(myCorpus, removeNumbers)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))  #??
myCorpusCopy <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)

for (i in 1:4) {
  cat(paste("[[", i, "]] ", sep = ""))
  writeLines(as.character(myCorpus[[i]]))
}

Output:
  [[1]] i am member of the xyz associ
  [[2]] appli for our open associ posit
  [[3]] xyz memori lectur take place on wednesday
  [[4]] vote for the most popular lectur


myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy)
for (i in 1:4) {
  cat(paste("[[", i, "]] ", sep = ""))
  writeLines(as.character(myCorpus[[i]]))
}

Output:
  [[1]] content
  meta
  [[2]] content
  meta
  [[3]] content
  meta
  [[4]] content
  meta

myCorpus <- tm_map(myCorpus, PlainTextDocument)

dtm <- DocumentTermMatrix(myCorpus, control = list(weighting = weightTf))
dtm
inspect(dtm)

Output:
  > inspect(dtm)
  <<DocumentTermMatrix (documents: 4, terms: 2)>>
    Non-/sparse entries: 8/0
  Sparsity           : 0%
  Maximal term length: 7
  Weighting          : term frequency (tf)

  Terms
  Docs           content meta
  character(0)       1    1
  character(0)       1    1
  character(0)       1    1
  character(0)       1    1

预期输出:成功运行词干(词干和词干完成)。我正在使用tm 0.6包

1 个答案:

答案 0 :(得分:0)

您使用错误的功能。以下是它的工作原理:

texts <- c("i am member of the XYZ association",
           "apply for our open associate position", 
           "xyz memorial lecture takes place on wednesday", 
           "vote for the most popular lecturer")
corp <- Corpus(VectorSource(texts))
tdm <- TermDocumentMatrix(corp, control = list(stemming = TRUE))
Terms(tdm)
#  [1] "appli"     "associ"    "for"       "lectur"    "member"    "memori"    "most"      "open"     
#  [9] "our"       "place"     "popular"   "posit"     "take"      "the"       "vote"      "wednesday"
# [17] "xyz" 
stemCompletion(Terms(tdm), corp)
# appli      associ         for      lectur      member      memori        most        open 
#    "" "associate"       "for"   "lecture"    "member"  "memorial"      "most"      "open" 
#   our       place     popular       posit        take         the        vote   wednesday 
# "our"     "place"   "popular"  "position"     "takes"       "the"      "vote" "wednesday" 
#   xyz 
# "xyz"