我有两个包含类似单词的语料库。与使用setdiff
相似,并没有真正帮助我的事业。所以我转向寻找一种方法来提取更频繁的单词列表或语料库(最终成为wordcloud)(假设这样的东西会有一个阈值 - 所以可能更频繁50%?)在语料库中# 1,与语料库#2相比。
这就是我现在拥有的一切:
> install.packages("tm")
> install.packages("SnowballC")
> install.packages("wordcloud")
> install.packages("RColorBrewer")
> library(tm)
> library(SnowballC)
> library(wordcloud)
> library(RColorBrewer)
> UKDraft = read.csv("UKDraftScouting.csv", stringsAsFactors=FALSE)
> corpus = Corpus(VectorSource(UKDraft$Report))
> corpus = tm_map(corpus, tolower)
> corpus = tm_map(corpus, PlainTextDocument)
> corpus = tm_map(corpus, removePunctuation)
> corpus = tm_map(corpus, removeWords, c("strengths", "weaknesses", "notes", "kentucky", "wildcats", stopwords("english")))
> frequencies = DocumentTermMatrix(corpus)
> allReports = as.data.frame(as.matrix(frequencies))
> SECDraft = read.csv("SECMinusUKDraftScouting.csv", stringsAsFactors=FALSE)
> SECcorpus = Corpus(VectorSource(SECDraft$Report))
> SECcorpus = tm_map(SECcorpus, tolower)
> SECcorpus = tm_map(SECcorpus, PlainTextDocument)
> SECcorpus = tm_map(SECcorpus, removePunctuation)
> SECcorpus = tm_map(SECcorpus, removeWords, c("strengths", "weaknesses", "notes", stopwords("english")))
> SECfrequencies = DocumentTermMatrix(SECcorpus)
> SECallReports = as.data.frame(as.matrix(SECfrequencies))
因此,如果单词“wingspan”在语料库#2('SECcorpus')中有100个计数频率,但在语料库#1('语料库')中有150个计数频率,我们希望在我们生成的语料库/列表中使用该单词。
答案 0 :(得分:3)
基于我与Paul Nulty开发的新文本分析包,我可以建议一种可能更简单的方法。它被称为quanteda,可在CRAN和GitHub上找到。
我无法访问您的文本,但这将以类似的方式用于您的示例。您创建两组文档的语料库,然后添加文档变量(使用docvars
),然后在新文档分区变量上创建文档特征矩阵分组。其余操作很简单,请参阅下面的代码。请注意,默认情况下,dfm
个对象是稀疏矩阵,但功能的子集尚未实现(下一个版本!)。
install.packages(quanteda)
library(quanteda)
# built-in character vector of 57 inaugural addreses
str(inaugTexts)
# create a corpus, with a partition variable to represent
# the two sets of texts you want to compare
inaugCorp <- corpus(inaugTexts,
docvars = data.frame(docset = c(rep(1, 29), rep(2, 28))),
notes = "Example made for stackoverflow")
# summarize the corpus
summary(inaugCorp, 5)
# toLower, removePunct are on by default
inaugDfm <- dfm(inaugCorp,
groups = "docset", # by docset instead of document
ignoredFeatures = c("strengths", "weaknesses", "notes", stopwords("english"))),
matrixType = "dense")
# now compare frequencies and trim based on ratio threshold
ratioThreshold <- 1.5
featureRatio <- inaugDfm[2, ] / inaugDfm[1, ]
# to select where set 2 feature frequency is 1.5x set 1 feature frequency
inaugDfmReduced <- inaugDfm[2, featureRatio >= ratioThreshold]
# plot the wordcloud
plot(inaugDfmReduced)
我建议您将一些选项传递给wordcloud()
(plot.dfm()
使用的内容),或许可以限制要绘制的最小要素数。
非常乐意协助您解决使用quanteda
包时可能遇到的任何疑问。
新强>
这是直接针对你的问题。我没有您的文件,因此无法验证它是否有效。此外,如果您的R技能有限,您可能会发现这很难理解;如果您还没有查看quanteda
的任何文件(现在可怜的话),那么同上。
我认为您需要的内容(基于您的评论/查询)如下:
# read in each corpus separately, directly into quanteda
mycorpus1 <- corpus(textfile("UKDraftScouting.csv", textField = "Report"))
mycorpus2 <- corpus(textfile("SECMinusUKDraftScouting.csv", textField = "Report"))
# assign docset variables to each corpus as appropriate
docvars(mycorpus1, "docset") <- 1
docvars(mycorpus2, "docset") <- 2
myCombinedCorpus <- mycorpus1 + mycorpus2
然后继续执行上述dfm
步骤,将myCombinedCorpus
替换为inaugTexts
。
答案 1 :(得分:0)
我正在更新@ken Benoit的答案,因为它已经好几年了,quanteda包已经经历了语法上的一些重大改变。
目前的版本应该是(2017年4月):
str(inaugTexts)
# create a corpus, with a partition variable to represent
# the two sets of texts you want to compare
inaugCorp <- corpus(inaugTexts,
docvars = data.frame(docset = c(rep(1, 29), rep(2, 29))),
notes = "Example made for stackoverflow")
# summarize the corpus
summary(inaugCorp, 5)
inaugDfm <- dfm(comment_corpus,
groups = "docset", # by docset instead of document
remove = c("<p>", "http://", "www", stopwords("english")),
remove_punct = TRUE,
remove_numbers = TRUE,
stem = TRUE)