我有两个数据帧。对于第一个数据帧(df)中的每一行,第二个数据帧(设计)中有三个对应的行。代码要做的是将df中的每一行都与设计中的三个对应行匹配,然后将三个对应的行以及其他一些所需变量附加到新的数据框中。
到目前为止,我的代码是:
df1 <- NULL
for(i in 1:nrow(df)){
x <- c(design[which(df$version[i] == design$version & df$task[i] == design$task) , ])
for(j in seq_along(x[[3]])){
set <- NULL
set <- cbind(t1 = 0,
t2 = 0,
t3 = 0,
t4 = 0,
resp_id = df$resp_id[i],
block = df$version[i],
task = x$task[j],
concept = x$concept[j],
brand = x$brand[j],
branch_type = x$type_of_branch[j],
branch_prox = x$branch_prox[j],
atm_prox = x$atm_prox[j],
atm_location_phy = x$atm_location_phys[j],
atm_fees = x$atm_fees[j],
service = x$service[j],
monthly_charge = x$monthly_charge[j],
checking_w_interest = x$checking_w_interest[j],
overdraft_prot = x$overdraft_prot[j],
benefits = x$benefits[j],
none = 0,
pick = ifelse(df$dc1[i] == x[[3]][[j]], 1, 0))
df1 <- data.frame(rbind(df1, set))
}
}
您可能只是看一下就能知道,这段代码非常慢,我需要大大降低这段代码的运行速度。
在第一个数据帧中有55000多个观察值,因此我一直在使用数据表(为了提高速度),并尝试使用lapply遍历列表x的每个元素(x长3个列表,每个列表中有15个元素)。我为此的代码是:
df1 <- data.table(t1 = numeric(),
t2 = numeric(),
t3 = numeric(),
t4 = numeric(),
resp_id = numeric(),
block = numeric(),
task = numeric(),
concept = numeric(),
brand = numeric(),
branch_type = numeric(),
branch_prox = numeric(),
atm_prox = numeric(),
atm_location_phys = numeric(),
atm_location_digi = numeric(),
atm_fees = numeric(),
service = numeric(),
monthly_charge = numeric(),
checking_w_interest = numeric(),
overdraft_prot = numeric(),
benefits = numeric(),
none = numeric(),
pick = numeric())
df2 <- data.table()
for(i in 1:nrow(df)){
set <- NULL
x <- data.table(design[which(df$version[i] == design$version & df$task[i] == design$task) , ])
set <- list(x[1], x[2], x[3])
df1 <- data.table(do.call(rbind, lapply(seq_along(1:3), function(y){
set.temp <- list(t1 = 0,
t2 = 0,
t3 = 0,
t4 = 0,
resp_id = df$resp_id[i],
block = df$version[i],
task = set[[y]]$task,
concept = set[[y]]$concept,
brand = set[[y]]$brand,
branch_type = set[[y]]$type_of_branch,
branch_prox = set[[y]] $branch_prox,
atm_prox = set[[y]]$atm_prox,
atm_location_phys = set[[y]]$atm_location_phys,
atm_location_digi = set[[y]]$atm_location_digi,
atm_fees = set[[y]]$atm_fees,
service = set[[y]]$service,
monthly_charge = set[[y]]$monthly_charge,
checking_w_interest = set[[y]]$checking_w_interest,
overdraft_prot = set[[y]]$overdraft_prot,
benefits = set[[y]]$benefits,
none = 0,
pick = ifelse(df$dc1[i] == set[[y]]$concept, 1, 0)) })))
df2 <- rbind(df2, df1)
}
第一组代码花费了一个多小时的时间才能运行。第二个代码块仍在运行,但可能需要大约45分钟。
如果您可以权衡一下,并提供一些有关我可以在哪里加速代码的指针,我将不胜感激。
答案 0 :(得分:0)
下面的代码怎么样?
rbindlist(lapply(1:nrow(df), function(i) {
x <- setDT(design[which(df$version[i] == design$version & df$task[i] == design$task), ])
set <- list(x[1], x[2], x[3])
df1 <- rbindlist(lapply(seq_along(1:3), function(y){
data.table(
t1 = 0,
t2 = 0,
t3 = 0,
t4 = 0,
resp_id = df$resp_id[i],
block = df$version[i],
task = set[[y]]$task,
concept = set[[y]]$concept,
brand = set[[y]]$brand,
branch_type = set[[y]]$type_of_branch,
branch_prox = set[[y]] $branch_prox,
atm_prox = set[[y]]$atm_prox,
atm_location_phys = set[[y]]$atm_location_phys,
atm_location_digi = set[[y]]$atm_location_digi,
atm_fees = set[[y]]$atm_fees,
service = set[[y]]$service,
monthly_charge = set[[y]]$monthly_charge,
checking_w_interest = set[[y]]$checking_w_interest,
overdraft_prot = set[[y]]$overdraft_prot,
benefits = set[[y]]$benefits,
none = 0,
pick = ifelse(df$dc1[i] == set[[y]]$concept, 1, 0)
)
}))
return(df1)
}))
rbindlist减少了处理时间。