Question

我试图在一段时间内从一系列文档中绘制bigram的TF.IDF。这是为了检测单词重要性的趋势。这些文本来自SQL Server的数据集。它有两列，一个是我想要标记器和数据的事件文本，另一列是文本所属的时间段（2010年1月10日，等等）。我可以多次查询SQL并为每个时段创建多个语料库，但效率不高。我宁愿一次调用我的查询，把所有内容都放回一个数据集和一个统一的语料库中。

我有一个伪代码，但不确定它是正确的方法。

While Loop

Get subset of unified corpus for a given month
Convert the subset to dtm
Calculate tf-idf
Save tf-idf value to a list (hash table) with a key of (i am not sure yet)

Until last month

Plot the tf-idf for a given bi-gram over the month

到目前为止我有以下内容，并且没有任何想法如何继续。如何根据时间段将统一语料库子集化为单个语料库？或者如何将月份与语料库相关联？并假设下面的逻辑是解决我的问题的正确方法，当我得到一个tfxidf列表时，如何为给定的二元组绘制tfxidf？

谢谢

list_corpora <- lapply(1:length(list_text), function(i) Corpus(VectorSource(list_exam[[i]])))

skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
list_corpora <- lapply(1:length(list_corpora), function(i) tm_map(list_corpora[[i]], FUN = tm_reduce, tmFuns = funcs))

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

list_dtms <- lapply(1:length(list_corpora), function(i) TermDocumentMatrix(list_corpora1[[i]], control = list(tokenize = BigramTokenizer)))

list_tfxidf <- lapply(1:length(list_corpora), function(i) weightTfIdf(list_corpora[[i]])

绘制bigram随时间的TF.IDF值

0 个答案: