假设我有以下内容:
x10 = data.frame(id = c(1,2,3),vars =c('top','down','top'),
text1=c('this is text','so is this','and this is too.'),
text2=c('we have more text here','and here too','and look at this, more text.'))
我想使用以下内容在quanteda中创建一个dfm / corpus:
x1 = corpus(x10,docid_field='id',text_field=c(3:4),tolower=T)
显然这会出错,因为text_field只需要一列。除了构建两个语料库之外,还有更好的方法来处理这个问题吗?我可以构建2然后合并id吗?那是一件事吗?
答案 0 :(得分:2)
首先,让我们重新创建你的data.frame而不考虑字符值:
x10 = data.frame(id = c(1,2,3), vars = c('top','down','top'),
text1 = c('this is text', 'so is this', 'and this is too.'),
text2 = c('we have more text here', 'and here too', 'and look at this, more text.'),
stringsAsFactors = FALSE)
然后我们有两个选择。
首先“熔化”数据,因此只有一列,然后作为语料库导入。 (另一种选择是tidy::gather()
。)
x10b <- reshape2::melt(x10, id.vars = c("id", "vars"),
measure.vars = c("text1", "text2"),
variable.name = "doc_id", value.name = "text")
# because corpus() takes document names from row names, by default
row.names(x10b) <- paste(x10b$doc_id, x10b$id, sep = "_")
x10b
# id vars doc_id text
# text1_1 1 top text1 this is text
# text1_2 2 down text1 so is this
# text1_3 3 top text1 and this is too.
# text2_1 1 top text2 we have more text here
# text2_2 2 down text2 and here too
# text2_3 3 top text2 and look at this, more text.
x10_corpus <- corpus(x10b)
summary(x10_corpus)
# Corpus consisting of 6 documents:
#
# Text Types Tokens Sentences id vars doc_id
# text1_1 3 3 1 1 top text1
# text1_2 3 3 1 2 down text1
# text1_3 5 5 1 3 top text1
# text2_1 5 5 1 1 top text2
# text2_2 3 3 1 2 down text2
# text2_3 8 8 1 3 top text2
#
# Source: /Users/kbenoit/Dropbox (Personal)/GitHub/lse-my459/assignment-2/* on x86_64 by kbenoit
# Created: Tue Feb 6 19:06:07 2018
# Notes:
在这里,我们分别创建两个语料库对象,并使用+
运算符组合它们。
x10_corpus2 <-
corpus(x10[, -which(names(x10)=="text2")], text_field = "text1") +
corpus(x10[, -which(names(x10)=="text1")], text_field = "text2")
summary(x10_corpus2)
# Corpus consisting of 6 documents:
#
# Text Types Tokens Sentences id vars
# text1 3 3 1 1 top
# text2 3 3 1 2 down
# text3 5 5 1 3 top
# text11 5 5 1 1 top
# text21 3 3 1 2 down
# text31 8 8 1 3 top
#
# Source: Combination of corpuses corpus(x10[, -which(names(x10) == "text2")], text_field = "text1") and corpus(x10[, -which(names(x10) == "text1")], text_field = "text2")
# Created: Tue Feb 6 19:14:14 2018
# Notes:
您也可以在此阶段使用docnames(x10_corpus2) <-
重新分配文档名称,使其更像第一种方法。