R - 从较大的语料库中删除语料库词集以查找唯一词

时间:2015-05-28 22:09:36

标签: r corpus text-analysis word-cloud

我有两个语料库(我将其转换为DocumentTermMatrices,数据框,然后是wordclouds),其中一个是另一个的子集。确切地说,一个是关于一所大学的文本语料库,另一个是关于该会议中所有大学的文本语料库。

R中有没有办法只提取较小的单词集所特有的单词?这是我到目前为止每个语料库运行的内容(这是'会议'语料库)

(

谢谢你们!

1 个答案:

答案 0 :(得分:1)

在我对你的其他帖子的回复中,我会在quanteda package中执行此操作。我无法测试这个,因为我没有.csv文件,但这应该有效:

# install.packages(quanteda)
require(quanteda)

# read in each corpus separately, directly into quanteda
mycorpus1 <- corpus(textfile("UKDraftScouting.csv", textField = "report"))
mycorpus2 <- corpus(textfile("SECMinusUKDraftScouting.csv", textField = "report"))
# assign docset variables to each corpus as appropriate 
docvars(mycorpus1, "docset") <- 1 
docvars(mycorpus2, "docset") <- 2
myCombinedCorpus <- mycorpus1 + mycorpus2

myDfm <- dfm(myCombinedCorpus, 
             groups = "docset", # by docset instead of document
             ignoredFeatures = c("strengths", "weaknesses", "notes", stopwords("english"))),
             matrixType = "dense")

# create a logical vector indexing the features unique to corpus 1
uniqueToCorpus1 <- (myDfm[1, ] & !myDfm[2, ])
# this is the dfm with features unique to dfm1
myDfm[1, uniqueToCorpus1]
# list the word features as a character vector
features(myDfm[1, uniqueToCorpus1])