我正在尝试分析存储在data.table中的语料库的ngram。我想计算所有1克(或2,3,4克),存储它们,它们的数量以及它们出现在data.table中的哪一行。我使用sapply管理:
smallCorpus<-data.table(id = 1:3,
corpus = c("<s> exactly how long do you want a you tube videos to be anyway </s>","<s> google scrapped the early version of its smart glasses in january </s>","<s> exactly how long do you want a you tube videos to be anyway </s> <s> today we are announcing the success of our integration test </s>"),
key="id")
library(stringi,tau)
genNgramTable<-function(cC,n){
Count<- textcnt(cC[,corpus],n=n,split=" ",method="string",decreasing=TRUE)
Ngram<-data.table(gram=names(Count),count=Count,key="gram")
listOfOcc<-sapply(Ngram[,gram],
function(gram,corpus){which(stri_detect_fixed(corpus," "%s+%gram%s+%" "))},
cC[,corpus])
Ngram<-Ngram[,Fkey:=listOfOcc]
}
gram1<-genNgramTable(smallCorpus,1L)
我的问题是:是否可以使用data.table调用来执行此操作(我希望它会更快)。我试过了:
genNgramTable<-function(cC,n){
Count<- textcnt(cC[,corpus],n=n,split=" ",method="string",decreasing=TRUE)
Ngram<-data.table(gram=names(Count),count=Count,key="gram")
Ngram<-Ngram[,Fkey:=which(stri_detect_fixed(cC[,corpus]," "%s+%gram%s+%" "))]
}
它发出警告
Warning message:
In `[.data.table`(Ngram, , `:=`(Fkey, which(stri_detect_fixed(cC[, :
Supplied 17 items to be assigned to 33 items of column 'Fkey' (recycled leaving remainder of 16 items).
并且只在Fkey列中给我一个数字。此外,这个数字超出了我的行号(1:3)的范围。
如果有人能解释我原因,我将不胜感激。