我有一个包含许多演讲的单个文本文件。该文件包含两个变量,一个用于if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.M)
alarmManager.setAlarmClock(new AlarmManager.AlarmClockInfo(d.getTime(),pendingIntent),pendingIntent);
else if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.LOLLIPOP)
alarmManager.setExact(AlarmManager.RTC, d.getTime(), pendingIntent);
else
alarmManager.set(AlarmManager.RTC, d.getTime(), pendingIntent);
,另一个用于speech_id
的文本,并由管道speech
分隔。我正在尝试使用|
中的corpus_segment
函数将文本分解为较小的文档。
quanteda
文件如下所示:
.txt
我尝试了各种迭代,但似乎无法让它发挥作用。我也尝试使用readtext包中的readtext函数来读取它,但没有运气。任何帮助是极大的赞赏。
答案 0 :(得分:0)
corpus_segment()
应该可以正常工作。 (这是基于 quanteda > = 1.0.0。)在这里,我假设所有语音ID都是10位数后跟|
个字符。请注意, readtext 会读取此.txt文件,但它应该是一行的单个“文档”。
library("quanteda")
txt <- "Speech_id|speech1140000001|This is the first speech.1140000002|The second
speech starts here.1140000003|This is the third speech.1140000004|The fourth
speaker says this."
corp <- corpus(txt)
corpseg <- corpus_segment(corp, pattern = "\\d{10}\\|", valuetype = "regex")
texts(corpseg)
## text1.1 text1.2
## "This is the first speech." "The second \nspeech starts here."
## text1.3 text1.4
## "This is the third speech." "The fourth \nspeaker says this."
得到了它,但我们可以通过将提取的模式移动到文档名称来进一步整理它。
# move the tag to docname after removing "|"
docnames(corpseg) <-
stringi::stri_replace_all_fixed(docvars(corpseg, "pattern"), "|", "")
# remove the pattern as a docvar
docvars(corpseg, "pattern") <- NULL
summary(corpseg)
## Corpus consisting of 4 documents:
##
## Text Types Tokens Sentences
## 1140000001 6 6 1
## 1140000002 6 6 1
## 1140000003 6 6 1
## 1140000004 6 6 1
##
## Source: /Users/kbenoit/Dropbox (Personal)/tmp/ascharacter/* on x86_64 by kbenoit
## Created: Tue Mar 27 07:41:05 2018
## Notes: corpus_segment.corpus(corp, pattern = "\\d{10}\\|", valuetype = "regex")