Question

我想简要介绍一下pdf文件集。我要包括以下列：文本类型标记句子（如Quanteda快速入门指南中所示），并排除所有其他列。来自

names(docvars(corp_mk))

我知道了

[1] "author"        "datetimestamp" "description"   "heading"       "id"           
[6] "language"      "origin"

不应在摘要中显示。

我尝试在summary（）命令中使用“ showmeta = FALSE”，但它仍包含所有列。

我得到：

 Text Types Tokens Sentences       author       datetimestamp description
   MoKa_BA_LG_16.pdf  1194   8620       283 Pressestelle 2016-07-27 13:01:04            
  MoKa_BBK_DO_18.pdf   810   2643        56      spalgen 2018-07-03 09:00:13        <NA>
 MoKa_BBK_DUE_18.pdf  1327   6219        97      Suttkus 2018-01-24 14:44:37        <NA>

我想要

      Text            Types Tokens Sentences       
   MoKa_BA_LG_16.pdf  1194   8620       283          
  MoKa_BBK_DO_18.pdf   810   2643        56      
 MoKa_BBK_DUE_18.pdf  1327   6219        97

我必须先从语料库中提取列，然后才能进行摘要还是可以通过Quanteda命令完成？

Answer 1

summary.corpus()方法以静默方式返回打印的data.frame。因此，如果您只想要文本摘要列，请按以下方式将其切成薄片：

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

summary(data_corpus_irishbudget2010)[, c("Text", "Types", "Tokens", "Sentences")]
##                         Text Types Tokens Sentences
## 1        Lenihan, Brian (FF)  1953   8641       374
## 2       Bruton, Richard (FG)  1040   4446       217
## 3         Burton, Joan (LAB)  1624   6393       307
## 4        Morgan, Arthur (SF)  1595   7107       343
## 5          Cowen, Brian (FF)  1629   6599       250
## 6           Kenny, Enda (FG)  1148   4232       153
## 7      ODonnell, Kieran (FG)   678   2297       133
## 8       Gilmore, Eamon (LAB)  1181   4177       201
## 9     Higgins, Michael (LAB)   488   1286        44
## 10       Quinn, Ruairi (LAB)   439   1284        59
## 11     Gormley, John (Green)   401   1030        49
## 12       Ryan, Eamon (Green)   510   1643        90
## 13     Cuffe, Ciaran (Green)   442   1240        45
## 14 OCaolain, Caoimhghin (SF)  1188   4044       176

显示特定docvar的摘要

1 个答案: