data.table语法相当于sapply?

时间:2015-07-31 12:14:21

标签: r data.table n-gram

我正在尝试分析存储在data.table中的语料库的ngram。我想计算所有1克(或2,3,4克),存储它们,它们的数量以及它们出现在data.table中的哪一行。我使用sapply管理:

smallCorpus<-data.table(id = 1:3,
                        corpus = c("<s> exactly how long do you want a you tube videos to be anyway </s>","<s> google scrapped the early version of its smart glasses in january </s>","<s> exactly how long do you want a you tube videos to be anyway </s> <s> today we are announcing the success of our integration test </s>"),
                        key="id")

library(stringi,tau)
genNgramTable<-function(cC,n){
    Count<- textcnt(cC[,corpus],n=n,split=" ",method="string",decreasing=TRUE)
    Ngram<-data.table(gram=names(Count),count=Count,key="gram")
    listOfOcc<-sapply(Ngram[,gram],
                      function(gram,corpus){which(stri_detect_fixed(corpus," "%s+%gram%s+%" "))},
                      cC[,corpus])
    Ngram<-Ngram[,Fkey:=listOfOcc]
}

gram1<-genNgramTable(smallCorpus,1L)

我的问题是:是否可以使用data.table调用来执行此操作(我希望它会更快)。我试过了:

genNgramTable<-function(cC,n){
    Count<- textcnt(cC[,corpus],n=n,split=" ",method="string",decreasing=TRUE)
    Ngram<-data.table(gram=names(Count),count=Count,key="gram")
    Ngram<-Ngram[,Fkey:=which(stri_detect_fixed(cC[,corpus]," "%s+%gram%s+%" "))]
}

它发出警告

Warning message:
In `[.data.table`(Ngram, , `:=`(Fkey, which(stri_detect_fixed(cC[,  :
  Supplied 17 items to be assigned to 33 items of column 'Fkey' (recycled leaving remainder of 16 items).

并且只在Fkey列中给我一个数字。此外,这个数字超出了我的行号(1:3)的范围。

如果有人能解释我原因,我将不胜感激。

0 个答案:

没有答案