Question

我想使用findAssocs包的tm命令，但仅当语料库中有多个文档时才有效。相反，我有一个单列数据框，其中每一行包含来自Tweet的文本。是否可以将其转换为将每行作为新文档的语料库？

VCorpus (documents: 1, metadata (corpus/indexed): 0/0)
TermDocumentMatrix (terms: 71, documents: 1)

我有10行数据，希望将其转换为

VCorpus (documents: 10, metadata (corpus/indexed): 0/0)
TermDocumentMatrix (terms: 71, documents: 10)

Answer 1

我建议您在继续之前先阅读tm - 插图。回答下面的具体问题。

创建示例数据：

txt <- strsplit("I wanted to use the findAssocs of the tm package. but it works only when there are more than one documents in the corpus. I have a data frame table which has one column and each row has a tweet text. Is it possible to convert the into a corpus which takes each row as a new document?", split=" ")[[1]]
data <- data.frame(text=txt, stringsAsFactors=FALSE)
data[1:5, ]

将您的数据导入＆＃34;来源＆＃34;，您的＆＃34;来源＆＃34;进入＆＃34;语料库＆＃34;，然后从你的＆＃34;语料库＆＃34;中制作TDM：

library(tm)
tdm <- TermDocumentMatrix(Corpus(DataframeSource(data)))

show(tdm)
#A term-document matrix (35 terms, 58 documents)
#
#Non-/sparse entries: 43/1987
#Sparsity           : 98%
#Maximal term length: 10 
#Weighting          : term frequency (tf)

str(tdm)
#List of 6
# $ i       : int [1:43] 32 31 28 12 28 21 3 35 20 33 ...
# $ j       : int [1:43] 2 4 5 6 8 10 11 13 14 15 ...
# $ v       : num [1:43] 1 1 1 1 1 1 1 1 1 1 ...
# $ nrow    : int 35
# $ ncol    : int 58
# $ dimnames:List of 2
#  ..$ Terms: chr [1:35] "and" "are" "but" "column" ...
#  ..$ Docs : chr [1:58] "1" "2" "3" "4" ...
# - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
# - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

如何将具有单个列的R数据帧转换为tm的语料库，以便将每一行作为文档？

1 个答案: