我正在处理多个大型文本文档。因此,我希望使用函数来处理它们中的每一个。示例函数如下所示:
library(stringr)
library(stringi)
library(quanteda)
txtOverview<-function(x){
txt_file<-x
txt_size<-object.size(txt_file) #Getting text size
txt_file_ascii<-stri_trans_general(txt_file,"latin-ascii")
txt_stats<-stri_stats_general(txt_file_ascii)
txt_words<-stringr::str_count(txt_file_ascii,"\\S+")
txt_words<-sum(txt_words)
txt_corpus<-corpus(txt_file_ascii)
txt_corpus<-corpus_reshape(txt_corpus)
corpus_list<-c("txt_size"=txt_size,"lines"=txt_stats[1],"words"=txt_words,"characters"=txt_stats[3],"corpus"=txt_corpus)
return(corpus_list)
}
然后我调用函数:
test_txt1<-c("here is an example", "I am trying to learn corpus", "Why would a function return an integer?")
test_txt2<-c("sky is blue", "orange is orange", "why do chickens want to cross the road?")
test_txt3<-c("lunch was ok", "carbon hydrates taste good", "Why would lasagna have six layers?")
x<-txtOverview(test_txt1)
y<-txtOverview(test_txt2)
z<-txtOverview(test_txt3)
然后我想从每个列表中提取语料库,然后使用Quanteda将它们组合起来。看了x后,我发现x [5]或x $ corpus.documents是文本部分。但是,还有"corpus.metadata", "corpus.settings"
和"corpus.tokens"
。试试下面的语言
w<-x[5]+y[5]+z[5]
会返回错误
二元运算符的非数字参数