Question

我在文本文件中有一个关键字列表：

tagSrc <- "20171107 keyword dictionary v2.txt"
tagDictionary <- readLines(tagSrc, encoding="UTF-8")

...我有一个包含事件报告的.csv文件，我将其转换为VCorpus和TermDocumentMatrix：

srcFile <- "20170831 Event Log April 17.csv"
incidents <- read.csv(srcFile, stringsAsFactors = FALSE)
descriptions <- incidents$Description

desc_source <- VectorSource(descriptions)
desc_corpus <- VCorpus(desc_source)
desc_corpus <- tm_map(desc_corpus, removeNumbers)
desc_corpus <- tm_map(desc_corpus, removeWords, c(stopwords("en"))

#Various text-cleaning routines
desc_corpus <- tm_map(desc_corpus, content_transformer(replace_abbreviation))
desc_corpus <- tm_map(desc_corpus, content_transformer(replace_contraction))

desc_stem <- tm_map(desc_corpus, content_transformer(stemDocument), language="english")

#Here, the corpus is turned into a TDM
desc_dtm <- TermDocumentMatrix(desc_stem, control = list(dictionary = tagDictionary))

上面的最后一行给出了一个TDM，它只使用关键字列表中的术语（tagDictionary）。有没有办法我可以减少文件的数量，只减少那些包含一个或多个关键词的文件？

Answer 1

经过多次搜索后，我发现最好的方法是在构建TDM之前减少文档的数量。通过grep对语料库进行tm_filter搜索会选出文档：

#collapse the search terms in `tagDictionary` into a single string: allTags<-paste(tagDictionary, collapse='|') #filter the VCorpus (see code in OP): corp_subset <- tm_filter(desc_stem, function(i), any(grep(allTags, content(i), ignore.case = TRUE))) #create the TDM from the filtered VCorpus: desc_dtm <- TermDocumentMatrix(corp_subset)

R - 从包含关键词的VCorpus（或TermDocumentMatrix）中挑选文档

1 个答案: