我有以下示例数据框
comments date
1 i want to hear that 2010-11-01
2 lets get started 2008-03-25
3 i want to get started 2007-03-14
我想从所有文件中获取单词频率,并且还要存储单词出现的文档编号(1,2或3)。
输出应该是一个矩阵,在一列中有单词,在另一列中有它们的频率,在第三列中有文档号。
我尝试了正式的tm包,但它不适用于我的情况。
答案 0 :(得分:1)
使用tm包和tidyr
library(tm)
library(tidyr)
df <- data.frame(id = c(1, 2, 3),
comments = c("that is that", "lets get started", "i want to get started"),
date = as.Date(c("2010-11-01", "2008-03-25", "2007-03-14")), stringsAsFactors = FALSE)
corpus <- Corpus(VectorSource(df$comments))
dtm <- DocumentTermMatrix(corpus, control=list(wordLengths=c(1, Inf)))
my_data <- data.frame(as.matrix(dtm), id = df$id, date = df$date)
outcome <- gather(my_data, words, freq, -id, -date)
head(outcome)
id date words freq
1 1 2010-11-01 get 0
2 2 2008-03-25 get 1
3 3 2007-03-14 get 1
4 1 2010-11-01 i 0
5 2 2008-03-25 i 0
6 3 2007-03-14 i 1
答案 1 :(得分:1)
我最近一直在使用 data.table 加上 stringi ,所以我想我会抛出这些类似于 dplyr 解决方案,但可以通过更大的数据集提供更好的速度提升。
dat <- data.frame(
comments= c("i want to hear that", "lets get started", "i want to get started"),
date = as.Date(c("2010-11-01", "2008-03-25", "2007-03-14")), stringsAsFactors = FALSE
)
library(data.table); library(stringi)
setDT(dat)
dat[, list(word = unlist(stri_extract_all_words(comments)))][,
list(freq=.N), by = 'word'][order(word),]
## word freq
## 1: get 2
## 2: hear 1
## 3: i 2
## 4: lets 1
## 5: started 2
## 6: that 1
## 7: to 2
## 8: want 2
dat[, list(word = unlist(stri_extract_all_words(comments))), by="date"][,
list(freq=.N), by = c('date', 'word')][order(date, word),]
## date word freq
## 1: 2007-03-14 get 1
## 2: 2007-03-14 i 1
## 3: 2007-03-14 started 1
## 4: 2007-03-14 to 1
## 5: 2007-03-14 want 1
## 6: 2008-03-25 get 1
## 7: 2008-03-25 lets 1
## 8: 2008-03-25 started 1
## 9: 2010-11-01 hear 1
## 10: 2010-11-01 i 1
## 11: 2010-11-01 that 1
## 12: 2010-11-01 to 1
## 13: 2010-11-01 want 1
答案 2 :(得分:0)
library(dplyr)
library(tidyr)
library(stringi)
word__date =
data_frame(
comments= c("i want to hear that", "lets get started", "i want to get started"),
date = c("2010-11-01", "2008-03-25", "2007-03-14") %>% as.Date ) %>%
mutate(word = comments %>% stri_split_fixed(pattern = " ")) %>%
unnest(word) %>%
group_by(word, date) %>%
summarize(count = n())
word =
word__date %>%
summarize(count = sum(count))