我的csv文件中只有评论(仅行无列)。我想计算csv文件中单词的频率,例如“爱”之类的单词。我不想每个单词都出现频率。我只希望这三个单词出现在csv文档中多少次。 我已经尝试过这些代码,但是它给了我我不想使用的每个单词的频率。 谁能帮助我计算特定单词或特定单词列表的频率?
texts <- read.csv("./Data/fast food/Domino's/Domino's veg pizza.csv",stringsAsFactors = FALSE)
docs <- Corpus(VectorSource(texts))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
cor<-Corpus(VectorSource(texts$Reviews))
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 20)
findFreqTerms(dtm, 10)
答案 0 :(得分:0)
如果要计算特定单词的出现频率,可以使用DocumentTermMatrix
控制部分中的dictionary选项。
在您的情况下:
my_words <- c("love", "like" , "best")
dtm <- TermDocumentMatrix(docs, control = list(dictionary = my_words))
以下是可重现的示例:
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))
my_words <- c("oil", "contract")
my_dtm <- DocumentTermMatrix(crude, control = list(dictionary = my_words))
inspect(my_dtm)
<<DocumentTermMatrix (documents: 20, terms: 2)>>
Non-/sparse entries: 24/16
Sparsity : 40%
Maximal term length: 8
Weighting : term frequency (tf)
Sample :
Terms
Docs contract oil
127 2 5
144 0 12
236 0 7
246 0 5
248 0 9
273 0 5
349 0 4
352 0 5
353 0 4
502 0 5