我想简要介绍一下pdf文件集。我要包括以下列:文本类型标记句子(如Quanteda快速入门指南中所示),并排除所有其他列。来自
names(docvars(corp_mk))
我知道了
[1] "author" "datetimestamp" "description" "heading" "id"
[6] "language" "origin"
不应在摘要中显示。
我尝试在summary()命令中使用“ showmeta = FALSE”,但它仍包含所有列。
我得到:
Text Types Tokens Sentences author datetimestamp description
MoKa_BA_LG_16.pdf 1194 8620 283 Pressestelle 2016-07-27 13:01:04
MoKa_BBK_DO_18.pdf 810 2643 56 spalgen 2018-07-03 09:00:13 <NA>
MoKa_BBK_DUE_18.pdf 1327 6219 97 Suttkus 2018-01-24 14:44:37 <NA>
我想要
Text Types Tokens Sentences
MoKa_BA_LG_16.pdf 1194 8620 283
MoKa_BBK_DO_18.pdf 810 2643 56
MoKa_BBK_DUE_18.pdf 1327 6219 97
我必须先从语料库中提取列,然后才能进行摘要还是可以通过Quanteda命令完成?
答案 0 :(得分:1)
summary.corpus()
方法以静默方式返回打印的data.frame。因此,如果您只想要文本摘要列,请按以下方式将其切成薄片:
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
summary(data_corpus_irishbudget2010)[, c("Text", "Types", "Tokens", "Sentences")]
## Text Types Tokens Sentences
## 1 Lenihan, Brian (FF) 1953 8641 374
## 2 Bruton, Richard (FG) 1040 4446 217
## 3 Burton, Joan (LAB) 1624 6393 307
## 4 Morgan, Arthur (SF) 1595 7107 343
## 5 Cowen, Brian (FF) 1629 6599 250
## 6 Kenny, Enda (FG) 1148 4232 153
## 7 ODonnell, Kieran (FG) 678 2297 133
## 8 Gilmore, Eamon (LAB) 1181 4177 201
## 9 Higgins, Michael (LAB) 488 1286 44
## 10 Quinn, Ruairi (LAB) 439 1284 59
## 11 Gormley, John (Green) 401 1030 49
## 12 Ryan, Eamon (Green) 510 1643 90
## 13 Cuffe, Ciaran (Green) 442 1240 45
## 14 OCaolain, Caoimhghin (SF) 1188 4044 176