我想使用原生R创建一个文档术语矩阵(没有其他插件,如tm)。数据结构如下:
Doc1: the test was to test the test
Doc2: we did prepare the exam to test the exam
Doc3: was the test the exam
Doc4: the exam we did prepare was to test the test
Doc5: we were successful so we all passed the exam
我想达到的目标如下:
Term Doc1 Doc2 Doc3 Doc4 Doc5 DF
1 all 0 0 0 0 1 1
2 did 0 1 0 1 0 2
3 exam 0 2 1 1 1 4
4 passed 0 0 0 0 1 1
答案 0 :(得分:1)
这是一种方法,但为什么不使用tm包?
## Your data
## dat <- structure(list(person = structure(1:5, .Label = c("Doc1", "Doc2",
## "Doc3", "Doc4", "Doc5"), class = "factor"),
## text = c("the test was to test the test",
## "we did prepare the exam to test the exam", "was the test the exam",
## "the exam we did prepare was to test the test",
## "we were successful so we all passed the exam"
## )), .Names = c("doc", "text"), class = "data.frame", row.names = c(NA,
## -5L))
## Function to turn list of vects into sparse matrix
mtabulate <- function(vects) {
lev <- sort(unique(unlist(vects)))
dat <- do.call(rbind, lapply(vects, function(x, lev){
tabulate(factor(x, levels = lev, ordered = TRUE),
nbins = length(lev))}, lev = lev))
colnames(dat) <- sort(lev)
data.frame(dat, check.names = FALSE)
}
out <- lapply(split(dat$text, dat$doc), function(x) {
unlist(strsplit(tolower(x), " "))
})
t(mtabulate(out))
## Doc1 Doc2 Doc3 Doc4 Doc5
## all 0 0 0 0 1
## did 0 1 0 1 0
## exam 0 2 1 1 1
## passed 0 0 0 0 1
## prepare 0 1 0 1 0
## so 0 0 0 0 1
## successful 0 0 0 0 1
## test 3 1 1 2 0
## the 2 2 2 2 1
## to 1 1 0 1 0
## was 1 0 1 1 0
## we 0 1 0 1 2
## were 0 0 0 0 1