我有一个220k行的数据帧(mydata),我希望每行有1列(BRLABELS)的8个if语句。简单如果/如果程序花了大约5分钟,我只想加快速度。
我试过像这样的开关功能方式。 起初我定义了它
group_label<-function(x){
switch(x,"15-19"=1,"20-24"=1,"25-29"=2,"30-34"=2,"35-39"=3,"40-44"=3,
"45-49"=4,"50-54"=4,"55-59"=5,"60-64"=5,"ISCED 0"=6,"ISCED 1"=6,"ISCED 2"=6,"ISCED 3"=7,"ISCED 4"=7,"ISCED 5"=8,"ISCED 6"=8,0)}
然后在for循环中使用它
for ( i in 1:k){
x<-mydata$BRLABELS[i]
mydata$group[i]<-group_label(x)}
令人困惑的部分是这个方法花了大约15分钟,而理论上switch方法适用于多个if语句。
有人可以解释为什么会发生这种情况并提供有效的替代方案吗?
答案 0 :(得分:4)
您可以将代码从交换机复制/粘贴到:
new_values <- c("15-19"=1,"20-24"=1,"25-29"=2,"30-34"=2,"35-39"=3,"40-44"=3, "45-49"=4,"50-54"=4,"55-59"=5,"60-64"=5,"ISCED 0"=6,"ISCED 1"=6,"ISCED 2"=6,"ISCED 3"=7,"ISCED 4"=7,"ISCED 5"=8,"ISCED 6"=8,0)
用以下内容更新值:
mydata$BRLABELS <- new_values[mydata$BRLABELS]
我认为BRLABELS不是因素(否则你的代码不会起作用)。
group_label<-function(x){
switch(x,"15-19"=1,"20-24"=1,"25-29"=2,"30-34"=2,"35-39"=3,"40-44"=3,
"45-49"=4,"50-54"=4,"55-59"=5,"60-64"=5,"ISCED 0"=6,"ISCED 1"=6,"ISCED 2"=6,"ISCED 3"=7,"ISCED 4"=7,"ISCED 5"=8,"ISCED 6"=8,0)}
new_values <- c("15-19"=1,"20-24"=1,"25-29"=2,"30-34"=2,"35-39"=3,"40-44"=3, "45-49"=4,"50-54"=4,"55-59"=5,"60-64"=5,"ISCED 0"=6,"ISCED 1"=6,"ISCED 2"=6,"ISCED 3"=7,"ISCED 4"=7,"ISCED 5"=8,"ISCED 6"=8,0)
mydata <-
data.frame(
BRLABELS =
sample(c("15-19","20-24","25-29","30-34","35-39","40-44",
"45-49","50-54","55-59","60-64","ISCED 0","ISCED 1","ISCED 2","ISCED 3",
"ISCED 4","ISCED 5","ISCED 6"),
10000, replace = TRUE ),
stringsAsFactors = FALSE)
mydata2 <- mydata
library(microbenchmark)
microbenchmark(times = 5,
for_loop = for ( i in 1:nrow(mydata)){
x<-mydata$BRLABELS[i]
mydata$group[i]<-group_label(x)},
direct = mydata2$group <- new_values[mydata2$BRLABELS]
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# for_loop 737247.663 765056.444 781973.1502 769505.576 814000.738 824055.330 5 b
# direct 325.432 326.715 375.2092 344.249 387.012 492.638 5 a
答案 1 :(得分:1)
最后使用了&#34; car&#34;的重新编码功能。詹姆斯提到的包裹。
mydata$BRLABELS<-recode(mydata$BRLABELS,"c('15-19','20-24')='15-24';c('25-29','30-34')='25-34';c('35-39','40-44')='35-44'; c('45-49','50-54')='45-54';c('55-59','60-64')='55-64';c('ISCED 0','ISCED 1','ISCED 2')='ISCED 0-2';c('ISCED 3','ISCED 4')='ISCED 3-4';c('ISCED 5','ISCED 6')='ISCED 5-6'; else ='0'")
它比for \ if循环更友好,并且时间量的差异很大。 最后,我使用plyr包添加了我想要的列(这是最终目的)。
ddply(mydata,~GEO +VAR +ANSWER +LABELS +BREAKDOWN +BRLABELS ,summarise,VALUE=sum(VALUE))
感谢帮助人员