如何在R中计算我的csv文件中特定单词的频率?

时间:2018-06-28 09:11:24

标签: r tm

我的csv文件中只有评论(仅行无列)。我想计算csv文件中单词的频率,例如“爱”之类的单词。我不想每个单词都出现频率。我只希望这三个单词出现在csv文档中多少次。  我已经尝试过这些代码,但是它给了我我不想使用的每个单词的频率。 谁能帮助我计算特定单词或特定单词列表的频率?

texts <- read.csv("./Data/fast food/Domino's/Domino's veg pizza.csv",stringsAsFactors = FALSE)
    docs <- Corpus(VectorSource(texts))
    toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
    docs <- tm_map(docs, toSpace, "/")
    docs <- tm_map(docs, toSpace, "@")
    docs <- tm_map(docs, toSpace, "\\|")
    docs <- tm_map(docs, content_transformer(tolower))
    docs <- tm_map(docs, removeNumbers)
    docs <- tm_map(docs, removeWords, stopwords("english"))
    docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
    docs <- tm_map(docs, removePunctuation)
    docs <- tm_map(docs, stripWhitespace)

cor<-Corpus(VectorSource(texts$Reviews))
    dtm <- TermDocumentMatrix(docs)
    m <- as.matrix(dtm)
    v <- sort(rowSums(m),decreasing=TRUE)
    d <- data.frame(word = names(v),freq=v)
    head(d, 20)
    findFreqTerms(dtm, 10)

1 个答案:

答案 0 :(得分:0)

如果要计算特定单词的出现频率,可以使用DocumentTermMatrix控制部分中的dictionary选项。

在您的情况下:

my_words <- c("love", "like" , "best")
dtm <- TermDocumentMatrix(docs, control = list(dictionary = my_words))

以下是可重现的示例:

library(tm)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))

my_words <- c("oil", "contract")

my_dtm <- DocumentTermMatrix(crude, control = list(dictionary = my_words))

inspect(my_dtm)
<<DocumentTermMatrix (documents: 20, terms: 2)>>
Non-/sparse entries: 24/16
Sparsity           : 40%
Maximal term length: 8
Weighting          : term frequency (tf)
Sample             :
     Terms
Docs  contract oil
  127        2   5
  144        0  12
  236        0   7
  246        0   5
  248        0   9
  273        0   5
  349        0   4
  352        0   5
  353        0   4
  502        0   5