Question

我有一个包含10个文档的Quanteda语料库，其中几个是同一作者。我将作者存储在单独的docvar列中 - myCorpus$documents[,"author"]

> docvars(myCorpus)

          author   
206035    author1   
269823    author2   
304225    author1   
422364    author2
<...snip..>

我正在制作Lexical Dispersion Plot with xplot_xray，

textplot_xray(
            kwic(myCorpus, "image"),
            kwic(myCorpus, "one"),
            kwic(myCorpus, "like"),
            kwic(myCorpusus, "time"),
            kwic(myCorpus, "just"),
            scale = "absolute"
          )

如何使用myCorpus$documents[,"author"]作为文档标识符而不是文档ID？

我不是要对文档进行分组，我只是想通过作者来识别文档。我发现Doc ID必须是唯一的，因此无法简单地使用docnames(myCorpus)<-

重命名文档

Answer 1

textplot文档名称取自语料库的docnames。在这种情况下，您希望创建按author docvar分组的新文档。这可以使用texts()提取器函数及其groups参数来完成。

要创建可重现的示例，我将使用内置数据对象data_char_sampletext，并将其细分为句子以形成新文档，然后模拟作者docvar。

library("quanteda")
# quanteda version 1.0.0

myCorpus <- corpus(data_char_sampletext) %>% 
    corpus_reshape(to = "sentences")
# make some duplicated author docvar values
set.seed(1)
docvars(myCorpus, "author") <- 
    sample(c("author1", "author2", "author3"), 
           size = ndoc(myCorpus), replace = TRUE)

这会产生：

summary(myCorpus)
# Corpus consisting of 15 documents:
#     
#     Text Types Tokens Sentences  author
#  text1.1    23     23         1 author1
#  text1.2    40     53         1 author2
#  text1.3    48     63         1 author2
#  text1.4    30     39         1 author3
#  text1.5    20     25         1 author1
#  text1.6    43     57         1 author3
#  text1.7    13     15         1 author3
#  text1.8    25     26         1 author2
#  text1.9     9      9         1 author2
# text1.10    37     53         1 author1
# text1.11    32     41         1 author1
# text1.12    30     30         1 author1
# text1.13    28     35         1 author3
# text1.14    16     18         1 author2
# text1.15    32     42         1 author3
# 
# Source:  /Users/kbenoit/tmp/* on x86_64 by kbenoit
# Created: Fri Feb 16 18:03:13 2018
# Notes:   corpus_reshape.corpus(., to = "sentences")

现在，我们将文本提取为字符向量，并通过author文档变量对这些文本进行分组。这将生成一个长度为3的命名字符向量，其中名称是（唯一的）作者标识符。

groupedtexts <- texts(myCorpus, groups = "author")
length(groupedtexts)
# [1] 3
names(groupedtexts)
# [1] "author1" "author2" "author3"

然后（如图）：

textplot_xray(
    kwic(groupedtexts, "and"),
    kwic(groupedtexts, "for")
)

Quanteda textplot_xray按非唯一docvar分组为文档

1 个答案: