我有两个语料库(我将其转换为DocumentTermMatrices,数据框,然后是wordclouds),其中一个是另一个的子集。确切地说,一个是关于一所大学的文本语料库,另一个是关于该会议中所有大学的文本语料库。
R中有没有办法只提取较小的单词集所特有的单词?这是我到目前为止每个语料库运行的内容(这是'会议'语料库)
(
谢谢你们!
答案 0 :(得分:1)
在我对你的其他帖子的回复中,我会在quanteda
package中执行此操作。我无法测试这个,因为我没有.csv文件,但这应该有效:
# install.packages(quanteda)
require(quanteda)
# read in each corpus separately, directly into quanteda
mycorpus1 <- corpus(textfile("UKDraftScouting.csv", textField = "report"))
mycorpus2 <- corpus(textfile("SECMinusUKDraftScouting.csv", textField = "report"))
# assign docset variables to each corpus as appropriate
docvars(mycorpus1, "docset") <- 1
docvars(mycorpus2, "docset") <- 2
myCombinedCorpus <- mycorpus1 + mycorpus2
myDfm <- dfm(myCombinedCorpus,
groups = "docset", # by docset instead of document
ignoredFeatures = c("strengths", "weaknesses", "notes", stopwords("english"))),
matrixType = "dense")
# create a logical vector indexing the features unique to corpus 1
uniqueToCorpus1 <- (myDfm[1, ] & !myDfm[2, ])
# this is the dfm with features unique to dfm1
myDfm[1, uniqueToCorpus1]
# list the word features as a character vector
features(myDfm[1, uniqueToCorpus1])