R - 从包含关键词的VCorpus(或TermDocumentMatrix)中挑选文档

时间:2017-11-09 12:09:03

标签: r keyword

我在文本文件中有一个关键字列表:

tagSrc <- "20171107 keyword dictionary v2.txt"
tagDictionary <- readLines(tagSrc, encoding="UTF-8")

...我有一个包含事件报告的.csv文件,我将其转换为VCorpusTermDocumentMatrix

srcFile <- "20170831 Event Log April 17.csv"
incidents <- read.csv(srcFile, stringsAsFactors = FALSE)
descriptions <- incidents$Description

desc_source <- VectorSource(descriptions)
desc_corpus <- VCorpus(desc_source)
desc_corpus <- tm_map(desc_corpus, removeNumbers)
desc_corpus <- tm_map(desc_corpus, removeWords, c(stopwords("en"))

#Various text-cleaning routines
desc_corpus <- tm_map(desc_corpus, content_transformer(replace_abbreviation))
desc_corpus <- tm_map(desc_corpus, content_transformer(replace_contraction))

desc_stem <- tm_map(desc_corpus, content_transformer(stemDocument), language="english")

#Here, the corpus is turned into a TDM
desc_dtm <- TermDocumentMatrix(desc_stem, control = list(dictionary = tagDictionary))

上面的最后一行给出了一个TDM,它只使用关键字列表中的术语(tagDictionary)。有没有办法我可以减少文件的数量,只减少那些包含一个或多个关键词的文件?

1 个答案:

答案 0 :(得分:0)

经过多次搜索后,我发现最好的方法是在构建TDM之前减少文档的数量。通过grep对语料库进行tm_filter搜索会选出文档:

#collapse the search terms in `tagDictionary` into a single string:
allTags<-paste(tagDictionary, collapse='|')

#filter the VCorpus (see code in OP):
corp_subset <- tm_filter(desc_stem, function(i), any(grep(allTags, content(i), ignore.case = TRUE)))

#create the TDM from the filtered VCorpus:
desc_dtm <- TermDocumentMatrix(corp_subset)