如何将quanteda包中的kwic转换为语料库?

时间:2016-05-25 18:55:07

标签: r text-mining quanteda

如何将kwic的输出转换为语料库以进行进一步分析? 更具体地说,我想基于关键字(contextPre,contextPost)之前和之后的单词创建一个语料库,以对它们进行进一步的情感分析。

1 个答案:

答案 0 :(得分:0)

最简单的方法:创建一个预上下文和后上下文语料库,使用标识上下文的文档变量(docvar),然后将两个语料库与+操作合并。

require(quanteda)
mykwic <- kwic(data_corpus_inaugural, "terror")

# make a corpus with the pre-word context
mycorpus <- corpus(mykwic$pre)
docvars(mycorpus, "context") <- "pre"

# make a corpus with the post-word context
mycorpus2 <- corpus(mykwic$post)
docvars(mycorpus2, "context") <- "post"

# combine the two corpora
mycorpus <- mycorpus + mycorpus2

summary(mycorpus)
# Corpus consisting of 16 documents.
# 
#  Text Types Tokens Sentences context
# text1     5      5         1     pre
# text2     4      5         1     pre
# text3     5      5         1     pre
# text4     5      5         1     pre
# text5     5      5         1     pre
# text6     5      5         1     pre
# text7     5      5         1     pre
# text8     5      5         1     pre
# text11     4      5         1    post
# text21     5      5         1    post
# text31     5      5         1    post
# text41     5      5         1    post
# text51     5      5         1    post
# text61     5      5         2    post
# text71     5      5         2    post
# text81     5      5         1    post
# 
# Source:  Combination of corpuses mycorpus and mycorpus2
# Created: Wed May 25 23:35:54 2016
# Notes:   

已添加:

从v0.9.7-6开始, quanteda 有一种方法可以直接从corpus对象构造kwic。所以这现在有效:

mykwic <- kwic(data_corpus_inaugural, "southern")
summary(corpus(mykwic))
# Corpus consisting of 28 documents.
# 
#      Text Types Tokens Sentences         docname position  keyword context
# text1.pre     5      5         1      1797-Adams     1807 southern     pre
# text2.pre     4      5         1      1825-Adams     2434 southern     pre
# text3.pre     4      5         1    1861-Lincoln       98 Southern     pre
# text4.pre     5      5         1    1865-Lincoln      283 southern     pre
# text5.pre     5      5         1      1877-Hayes      378 Southern     pre
# text6.pre     5      5         1      1877-Hayes      956 Southern     pre
# text7.pre     5      5         1      1877-Hayes     1250 Southern     pre
# text8.pre     5      5         1   1881-Garfield     1007 Southern     pre
# text9.pre     4      5         1       1909-Taft     4029 Southern     pre
# text10.pre     5      5         1       1909-Taft     4230 Southern     pre
# text11.pre     5      5         1       1909-Taft     4350 Southern     pre
# text12.pre     5      5         1       1909-Taft     4537 Southern     pre
# text13.pre     5      5         1       1909-Taft     4597 Southern     pre
# text14.pre     5      5         1 1953-Eisenhower     1226 southern     pre
# text1.post     5      5         1      1797-Adams     1807 southern    post
# text2.post     5      5         1      1825-Adams     2434 southern    post
# text3.post     5      5         1    1861-Lincoln       98 Southern    post
# text4.post     5      5         2    1865-Lincoln      283 southern    post
# text5.post     5      5         2      1877-Hayes      378 Southern    post
# text6.post     5      5         1      1877-Hayes      956 Southern    post
# text7.post     5      5         1      1877-Hayes     1250 Southern    post
# text8.post     5      5         2   1881-Garfield     1007 Southern    post
# text9.post     5      5         2       1909-Taft     4029 Southern    post
# text10.post     5      5         1       1909-Taft     4230 Southern    post
# text11.post     5      5         1       1909-Taft     4350 Southern    post
# text12.post     5      5         1       1909-Taft     4537 Southern    post
# text13.post     5      5         1       1909-Taft     4597 Southern    post
# text14.post     5      5         1 1953-Eisenhower     1226 southern    post
# 
# Source:  Corpus created from kwic(x, keywords = "southern")
# Created: Thu May 26 09:47:19 2016
# Notes: