quanteda中的段语料库

时间:2018-03-27 05:22:07

标签: r quanteda

我有一个包含许多演讲的单个文本文件。该文件包含两个变量,一个用于if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.M) alarmManager.setAlarmClock(new AlarmManager.AlarmClockInfo(d.getTime(),pendingIntent),pendingIntent); else if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.LOLLIPOP) alarmManager.setExact(AlarmManager.RTC, d.getTime(), pendingIntent); else alarmManager.set(AlarmManager.RTC, d.getTime(), pendingIntent); ,另一个用于speech_id的文本,并由管道speech分隔。我正在尝试使用|中的corpus_segment函数将文本分解为较小的文档。

quanteda文件如下所示:

.txt

我尝试了各种迭代,但似乎无法让它发挥作用。我也尝试使用readtext包中的readtext函数来读取它,但没有运气。任何帮助是极大的赞赏。

1 个答案:

答案 0 :(得分:0)

corpus_segment()应该可以正常工作。 (这是基于 quanteda > = 1.0.0。)在这里,我假设所有语音ID都是10位数后跟|个字符。请注意, readtext 会读取此.txt文件,但它应该是一行的单个“文档”。

library("quanteda")

txt <- "Speech_id|speech1140000001|This is the first speech.1140000002|The second 
speech starts here.1140000003|This is the third speech.1140000004|The fourth 
speaker says this."

corp <- corpus(txt)

corpseg <- corpus_segment(corp, pattern = "\\d{10}\\|", valuetype = "regex")
texts(corpseg)
##                     text1.1                            text1.2 
## "This is the first speech." "The second \nspeech starts here." 
##                     text1.3                            text1.4 
## "This is the third speech."  "The fourth \nspeaker says this." 

得到了它,但我们可以通过将提取的模式移动到文档名称来进一步整理它。

# move the tag to docname after removing "|"
docnames(corpseg) <- 
    stringi::stri_replace_all_fixed(docvars(corpseg, "pattern"), "|", "")
# remove the pattern as a docvar
docvars(corpseg, "pattern") <- NULL

summary(corpseg)
## Corpus consisting of 4 documents:
##     
##       Text Types Tokens Sentences
## 1140000001     6      6         1
## 1140000002     6      6         1
## 1140000003     6      6         1
## 1140000004     6      6         1
## 
## Source: /Users/kbenoit/Dropbox (Personal)/tmp/ascharacter/* on x86_64 by kbenoit
## Created: Tue Mar 27 07:41:05 2018
## Notes: corpus_segment.corpus(corp, pattern = "\\d{10}\\|", valuetype = "regex")