Question

我正在处理多个大型文本文档。因此，我希望使用函数来处理它们中的每一个。示例函数如下所示：

library(stringr)
library(stringi)
library(quanteda)

txtOverview<-function(x){

  txt_file<-x
  txt_size<-object.size(txt_file) #Getting text size
  txt_file_ascii<-stri_trans_general(txt_file,"latin-ascii") 
  txt_stats<-stri_stats_general(txt_file_ascii) 
  txt_words<-stringr::str_count(txt_file_ascii,"\\S+")
  txt_words<-sum(txt_words)
  txt_corpus<-corpus(txt_file_ascii) 
  txt_corpus<-corpus_reshape(txt_corpus) 
  corpus_list<-c("txt_size"=txt_size,"lines"=txt_stats[1],"words"=txt_words,"characters"=txt_stats[3],"corpus"=txt_corpus)
  return(corpus_list)
}

然后我调用函数：

test_txt1<-c("here is an example", "I am trying to learn corpus", "Why would a function return an integer?")
test_txt2<-c("sky is blue", "orange is orange", "why do chickens want to cross the road?")
test_txt3<-c("lunch was ok", "carbon hydrates taste good", "Why would lasagna have six layers?")
x<-txtOverview(test_txt1)
y<-txtOverview(test_txt2)
z<-txtOverview(test_txt3)

然后我想从每个列表中提取语料库，然后使用Quanteda将它们组合起来。看了x后，我发现x [5]或x $ corpus.documents是文本部分。但是，还有"corpus.metadata", "corpus.settings" 和"corpus.tokens"。试试下面的语言

w<-x[5]+y[5]+z[5]

会返回错误

二元运算符的非数字参数

R：如何从函数返回的元素列表中检索语料库？

0 个答案: