在dfm()输出

时间:2015-12-29 08:48:42

标签: r quanteda

我有一个带有ID号列和文本列的数据集,我正在使用quanteda包对文本数据运行LIWC分析。以下是我的数据设置示例:

mydata<-data.frame(
  id=c(19,101,43,12),
  text=c("No wonder, then, that ever gathering volume from the mere transit ",
         "So that in many cases such a panic did he finally strike, that few ",
         "But there were still other and more vital practical influences at work",
         "Not even at the present day has the original prestige of the Sperm Whale"),
  stringsAsFactors=F
)

我已经能够使用scores <- dfm(as.character(mydata$text), dictionary = liwc)

进行LIWC分析

但是,当我查看结果(View(scores))时,我发现该函数未在最终结果中引用原始ID号(19,101,43,12)。相反,我们会添加row.names列,但它包含非描述性标识符(例如,&#34; text1&#34;,&#34; text2&#34;):

enter image description here

如何让dfm()函数在其输出中包含ID号?谢谢!

1 个答案:

答案 0 :(得分:1)

听起来您希望dfm对象的行名称是mydata$id中的ID号。如果您将此ID声明为文本的文档名称,则会自动执行此操作。最简单的方法是从data.frame创建一个quanteda语料库对象。

下面的corpus()调用会从您的id变量中分配文档名称。注意:summary()调用中的“文本”看起来像一个数值,但它实际上是文本的文档名称。

require(quanteda)
myCorpus <- corpus(mydata[["text"]], docnames = mydata[["id"]])
summary(myCorpus)
# Corpus consisting of 4 documents.
# 
# Text Types Tokens Sentences
#   19    11     11         1
#  101    13     14         1
#   43    12     12         1
#   12    12     14         1
# 
# Source:  /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
# Created: Tue Dec 29 11:54:00 2015
# Notes:   

从那里,文档名称自动成为dfm中的行标签。 (您可以为LIWC应用程序添加dictionary =参数。)

myDfm <- dfm(myCorpus, verbose = FALSE)
head(myDfm)
# Document-feature matrix of: 4 documents, 45 features.
# (showing first 4 documents and first 6 features)
#      features
# docs  no wonder then that ever gathering
#   19   1      1    1    1    1         1
#   101  0      0    0    2    0         0
#   43   0      0    0    0    0         0
#   12   0      0    0    0    0         0