R tm TermDocumentMatrix基于稀疏矩阵

时间:2015-08-20 18:38:27

标签: r bash text-mining tm

我有一系列txt格式的书籍,并希望将tm R库的一些程序应用于它们。但是,我更喜欢用bash而不是R来清理文本,因为它更快。

假设我能够从bash获得一个data.frame,例如:

book term frequency
--------------------
1     the      10
1     zoo      2
2     animal   2
2     car      3
2     the      20

我知道TermDocumentMatrices实际上是带有元数据的稀疏矩阵。实际上,我可以使用TDM的i,j和v条目从TDM创建稀疏矩阵,用于sparseMatrix函数的i,j和x条目。如果您知道如何进行反向操作,请帮助我,或者在这种情况下,如何通过使用上面data.frame中的三列来构建TDM。谢谢!

1 个答案:

答案 0 :(得分:2)

你可以尝试

library(tm)
library(reshape2)
txt <- readLines(n = 7)
book term frequency
--------------------
1     the      10
1     zoo      2
2     animal   2
2     car      3
2     the      20
df <- read.table(header=T, text=txt[-2])
dfwide <- dcast(data = df, book ~ term, value.var = "frequency", fill = 0)
mat <- as.matrix(dfwide[, -1]) 
dimnames(mat) <- setNames(dimnames(dfwide[-1]), names(df[, 1:2]))
(tdm <- as.TermDocumentMatrix(t(mat), weighting = weightTf))
# <<TermDocumentMatrix (terms: 4, documents: 2)>>
#   Non-/sparse entries: 5/3
# Sparsity           : 38%
# Maximal term length: 6
# Weighting          : term frequency (tf)

as.matrix(tdm)
#        Docs
# Terms     1  2
# animal    0  2
# car       0  3
# the      10 20
# zoo       2  0