获得R中因子频率的最快方法是什么?

时间:2014-11-11 00:02:48

标签: r frequency

我正在尝试过滤掉数据集中的低频因素。问题看起来像这样:

require(digest)
require(ff)
require(ffbase)

test.vector.ffdf = as.ffdf(as.ff(as.factor(sample(sapply(1:1000, digest), 50000000, replace = T))))

get.frequency=function(i,column){   
  freq = sum(test.vector.ffdf[,column] == i)/length(test.vector.ffdf[,column])
  print(paste0(i,' ',freq))
  freq
}

column = 1
sapply(unique(test.vector.ffdf[,column]),get.frequency, column = column)

正如您所看到的,这需要很长时间,而且我有很多专栏要做到这一点,有数千个因素。有没有办法更快地检索频率?

澄清:在这个例子中,函数中的print()只是为了查看进度,而sapply将用于获取可以对其进行操作的频率列表,即[i where freq< 0.001]

2 个答案:

答案 0 :(得分:1)

require(digest)
require(ff)
require(ffbase)

test.vector.ffdf = as.ffdf(as.ff(as.factor(sample(sapply(1:10, digest), 50000000, replace = T))))
test.vector.ffdf$one <- ff(1L, length = nrow(test.vector.ffdf))
system.time(binned_sum(x = test.vector.ffdf$one, bin = test.vector.ffdf$x))
# user  system elapsed 
# 1.463   0.372   1.835 

答案 1 :(得分:0)

我尝试了What is the fastest way to obtain frequencies of integers in a vector?中的几种不同的方法,这些方法当然是整数,而不是字符 -

require(digest)
require(ff)
require(ffbase)

test.vector.ffdf = as.ffdf(as.ff(as.factor(sample(sapply(1:10, digest), 50000000, replace = T))))

get.frequency=function(i,column){   
  freq = sum(test.vector.ffdf[,column] == i)/length(test.vector.ffdf[,column])
  #print(paste0(i,' ',freq))
  freq
}

column = 1

x = test.vector.ffdf[,column]

system.time(table(x))
#   user  system elapsed 
#  3.548   0.000   3.561 

system.time(sapply(unique(test.vector.ffdf[,column]),get.frequency, column = column))
#   user  system elapsed 
# 39.049   5.127  44.322 

system.time({cdf<-cbind(sort(x),seq_along(x)); cdf<-cdf[!duplicated(cdf[,1]),2]; c(cdf[-1],length(x)+1)-cdf})
#   user  system elapsed 
#217.060   2.851 220.865 

编辑:添加上面的解决方案,以便可以在同一系统上进行比较:

test.vector.ffdf$one <- ff(1L, length = nrow(test.vector.ffdf))
> system.time(binned_sum(x = test.vector.ffdf$one, bin = test.vector.ffdf$x))
#   user  system elapsed 
#  0.731   0.283   1.018

所以看起来桌子是明显的胜利者,并且不受像我的解决方案那样的因素数量的影响。