R:加速多个if语句

时间:2015-04-20 10:39:28

标签: r switch-statement

我有一个220k行的数据帧(mydata),我希望每行有1列(BRLABELS)的8个if语句。简单如果/如果程序花了大约5分钟,我只想加快速度。

我试过像这样的开关功能方式。 起初我定义了它

group_label<-function(x){
  switch(x,"15-19"=1,"20-24"=1,"25-29"=2,"30-34"=2,"35-39"=3,"40-44"=3,
         "45-49"=4,"50-54"=4,"55-59"=5,"60-64"=5,"ISCED 0"=6,"ISCED 1"=6,"ISCED 2"=6,"ISCED 3"=7,"ISCED 4"=7,"ISCED 5"=8,"ISCED 6"=8,0)}

然后在for循环中使用它

for ( i in 1:k){
  x<-mydata$BRLABELS[i]
  mydata$group[i]<-group_label(x)}

令人困惑的部分是这个方法花了大约15分钟,而理论上switch方法适用于多个if语句。

有人可以解释为什么会发生这种情况并提供有效的替代方案吗?

2 个答案:

答案 0 :(得分:4)

您可以将代码从交换机复制/粘贴到:

new_values <- c("15-19"=1,"20-24"=1,"25-29"=2,"30-34"=2,"35-39"=3,"40-44"=3, "45-49"=4,"50-54"=4,"55-59"=5,"60-64"=5,"ISCED 0"=6,"ISCED 1"=6,"ISCED 2"=6,"ISCED 3"=7,"ISCED 4"=7,"ISCED 5"=8,"ISCED 6"=8,0)

用以下内容更新值:

mydata$BRLABELS <- new_values[mydata$BRLABELS]

我认为BRLABELS不是因素(否则你的代码不会起作用)。

更新:时间测试

group_label<-function(x){
  switch(x,"15-19"=1,"20-24"=1,"25-29"=2,"30-34"=2,"35-39"=3,"40-44"=3,
         "45-49"=4,"50-54"=4,"55-59"=5,"60-64"=5,"ISCED 0"=6,"ISCED 1"=6,"ISCED 2"=6,"ISCED 3"=7,"ISCED 4"=7,"ISCED 5"=8,"ISCED 6"=8,0)}

new_values <- c("15-19"=1,"20-24"=1,"25-29"=2,"30-34"=2,"35-39"=3,"40-44"=3, "45-49"=4,"50-54"=4,"55-59"=5,"60-64"=5,"ISCED 0"=6,"ISCED 1"=6,"ISCED 2"=6,"ISCED 3"=7,"ISCED 4"=7,"ISCED 5"=8,"ISCED 6"=8,0)

mydata <- 
  data.frame(
    BRLABELS = 
      sample(c("15-19","20-24","25-29","30-34","35-39","40-44",
               "45-49","50-54","55-59","60-64","ISCED 0","ISCED 1","ISCED 2","ISCED 3",
               "ISCED 4","ISCED 5","ISCED 6"), 
             10000, replace = TRUE ), 
    stringsAsFactors = FALSE)


mydata2 <- mydata




library(microbenchmark)

microbenchmark(times = 5,
  for_loop = for ( i in 1:nrow(mydata)){
    x<-mydata$BRLABELS[i]
    mydata$group[i]<-group_label(x)},
  direct = mydata2$group <- new_values[mydata2$BRLABELS]
  )


#     Unit: microseconds
#     expr            min         lq        mean     median         uq        max neval cld
#     for_loop 737247.663 765056.444 781973.1502 769505.576 814000.738 824055.330     5   b
#     direct      325.432    326.715    375.2092    344.249    387.012    492.638     5  a 

答案 1 :(得分:1)

最后使用了&#34; car&#34;的重新编码功能。詹姆斯提到的包裹。

mydata$BRLABELS<-recode(mydata$BRLABELS,"c('15-19','20-24')='15-24';c('25-29','30-34')='25-34';c('35-39','40-44')='35-44'; c('45-49','50-54')='45-54';c('55-59','60-64')='55-64';c('ISCED 0','ISCED 1','ISCED 2')='ISCED 0-2';c('ISCED 3','ISCED 4')='ISCED 3-4';c('ISCED 5','ISCED 6')='ISCED 5-6'; else ='0'") 

它比for \ if循环更友好,并且时间量的差异很大。 最后,我使用plyr包添加了我想要的列(这是最终目的)。

ddply(mydata,~GEO +VAR +ANSWER +LABELS +BREAKDOWN +BRLABELS ,summarise,VALUE=sum(VALUE)) 

感谢帮助人员