我有一个包含10个文档的Quanteda语料库,其中几个是同一作者。我将作者存储在单独的docvar列中 - myCorpus$documents[,"author"]
> docvars(myCorpus)
author
206035 author1
269823 author2
304225 author1
422364 author2
<...snip..>
我正在制作Lexical Dispersion Plot with xplot_xray,
textplot_xray(
kwic(myCorpus, "image"),
kwic(myCorpus, "one"),
kwic(myCorpus, "like"),
kwic(myCorpusus, "time"),
kwic(myCorpus, "just"),
scale = "absolute"
)
如何使用myCorpus$documents[,"author"]
作为文档标识符而不是文档ID?
我不是要对文档进行分组,我只是想通过作者来识别文档。我发现Doc ID必须是唯一的,因此无法简单地使用docnames(myCorpus)<-
答案 0 :(得分:1)
textplot文档名称取自语料库的docnames
。在这种情况下,您希望创建按author
docvar分组的新文档。这可以使用texts()
提取器函数及其groups
参数来完成。
要创建可重现的示例,我将使用内置数据对象data_char_sampletext
,并将其细分为句子以形成新文档,然后模拟作者docvar。
library("quanteda")
# quanteda version 1.0.0
myCorpus <- corpus(data_char_sampletext) %>%
corpus_reshape(to = "sentences")
# make some duplicated author docvar values
set.seed(1)
docvars(myCorpus, "author") <-
sample(c("author1", "author2", "author3"),
size = ndoc(myCorpus), replace = TRUE)
这会产生:
summary(myCorpus)
# Corpus consisting of 15 documents:
#
# Text Types Tokens Sentences author
# text1.1 23 23 1 author1
# text1.2 40 53 1 author2
# text1.3 48 63 1 author2
# text1.4 30 39 1 author3
# text1.5 20 25 1 author1
# text1.6 43 57 1 author3
# text1.7 13 15 1 author3
# text1.8 25 26 1 author2
# text1.9 9 9 1 author2
# text1.10 37 53 1 author1
# text1.11 32 41 1 author1
# text1.12 30 30 1 author1
# text1.13 28 35 1 author3
# text1.14 16 18 1 author2
# text1.15 32 42 1 author3
#
# Source: /Users/kbenoit/tmp/* on x86_64 by kbenoit
# Created: Fri Feb 16 18:03:13 2018
# Notes: corpus_reshape.corpus(., to = "sentences")
现在,我们将文本提取为字符向量,并通过author
文档变量对这些文本进行分组。这将生成一个长度为3的命名字符向量,其中名称是(唯一的)作者标识符。
groupedtexts <- texts(myCorpus, groups = "author")
length(groupedtexts)
# [1] 3
names(groupedtexts)
# [1] "author1" "author2" "author3"
然后(如图):
textplot_xray(
kwic(groupedtexts, "and"),
kwic(groupedtexts, "for")
)