优化R代码以将行绑定到数据帧

时间:2019-12-18 23:57:47

标签: r data.table

我有两个数据帧。对于第一个数据帧(df)中的每一行,第二个数据帧(设计)中有三个对应的行。代码要做的是将df中的每一行都与设计中的三个对应行匹配,然后将三个对应的行以及其他一些所需变量附加到新的数据框中。

到目前为止,我的代码是:

df1 <- NULL
for(i in 1:nrow(df)){
  x <- c(design[which(df$version[i] == design$version & df$task[i] == design$task) , ])
  for(j in seq_along(x[[3]])){
    set <- NULL
    set <- cbind(t1 = 0,
                 t2 = 0,
                 t3 = 0,
                 t4 = 0,
                 resp_id = df$resp_id[i],
                 block = df$version[i],
                 task = x$task[j],
                 concept = x$concept[j],
                 brand = x$brand[j],
                 branch_type = x$type_of_branch[j], 
                 branch_prox = x$branch_prox[j], 
                 atm_prox = x$atm_prox[j],
                 atm_location_phy = x$atm_location_phys[j],
                 atm_fees = x$atm_fees[j],
                 service = x$service[j],
                 monthly_charge = x$monthly_charge[j],
                 checking_w_interest = x$checking_w_interest[j],
                 overdraft_prot = x$overdraft_prot[j],
                 benefits = x$benefits[j],
                 none = 0,
                 pick = ifelse(df$dc1[i] == x[[3]][[j]], 1, 0))
    df1 <- data.frame(rbind(df1, set))
  }
}

您可能只是看一下就能知道,这段代码非常慢,我需要大大降低这段代码的运行速度。

在第一个数据帧中有55000多个观察值,因此我一直在使用数据表(为了提高速度),并尝试使用lapply遍历列表x的每个元素(x长3个列表,每个列表中有15个元素)。我为此的代码是:

df1 <- data.table(t1 = numeric(),
                  t2 = numeric(),
                  t3 = numeric(),
                  t4 = numeric(),
                  resp_id = numeric(),
                  block = numeric(),
                  task = numeric(),
                  concept = numeric(),
                  brand = numeric(),
                  branch_type = numeric(),
                  branch_prox = numeric(),
                  atm_prox = numeric(),
                  atm_location_phys = numeric(),
                  atm_location_digi = numeric(),
                  atm_fees = numeric(),
                  service = numeric(),
                  monthly_charge = numeric(),
                  checking_w_interest = numeric(),
                  overdraft_prot = numeric(),
                  benefits = numeric(),
                  none = numeric(),
                  pick = numeric())
df2 <- data.table()

for(i in 1:nrow(df)){
    set <- NULL
    x <- data.table(design[which(df$version[i] == design$version & df$task[i] == design$task) , ])

    set <- list(x[1], x[2], x[3])
    df1 <- data.table(do.call(rbind, lapply(seq_along(1:3), function(y){
        set.temp <- list(t1 = 0,
                         t2 = 0,
                         t3 = 0,
                         t4 = 0,
                         resp_id = df$resp_id[i],
                         block = df$version[i],
                         task = set[[y]]$task,
                         concept = set[[y]]$concept,
                         brand = set[[y]]$brand,
                         branch_type = set[[y]]$type_of_branch,
                         branch_prox = set[[y]] $branch_prox,
                         atm_prox = set[[y]]$atm_prox,
                         atm_location_phys = set[[y]]$atm_location_phys,
                         atm_location_digi = set[[y]]$atm_location_digi,
                         atm_fees = set[[y]]$atm_fees,
                         service = set[[y]]$service,
                         monthly_charge = set[[y]]$monthly_charge,
                         checking_w_interest = set[[y]]$checking_w_interest,
                         overdraft_prot = set[[y]]$overdraft_prot,
                         benefits = set[[y]]$benefits,
                         none = 0,
                         pick = ifelse(df$dc1[i] == set[[y]]$concept, 1, 0)) })))
    df2 <- rbind(df2, df1)
}

第一组代码花费了一个多小时的时间才能运行。第二个代码块仍在运行,但可能需要大约45分钟。

如果您可以权衡一下,并提供一些有关我可以在哪里加速代码的指针,我将不胜感激。

1 个答案:

答案 0 :(得分:0)

下面的代码怎么样?

rbindlist(lapply(1:nrow(df), function(i) {
  x <- setDT(design[which(df$version[i] == design$version & df$task[i] == design$task), ])
  set <- list(x[1], x[2], x[3])
  df1 <- rbindlist(lapply(seq_along(1:3), function(y){
    data.table(
      t1 = 0,
      t2 = 0,
      t3 = 0,
      t4 = 0,
      resp_id = df$resp_id[i],
      block = df$version[i],
      task = set[[y]]$task,
      concept = set[[y]]$concept,
      brand = set[[y]]$brand,
      branch_type = set[[y]]$type_of_branch,
      branch_prox = set[[y]] $branch_prox,
      atm_prox = set[[y]]$atm_prox,
      atm_location_phys = set[[y]]$atm_location_phys,
      atm_location_digi = set[[y]]$atm_location_digi,
      atm_fees = set[[y]]$atm_fees,
      service = set[[y]]$service,
      monthly_charge = set[[y]]$monthly_charge,
      checking_w_interest = set[[y]]$checking_w_interest,
      overdraft_prot = set[[y]]$overdraft_prot,
      benefits = set[[y]]$benefits,
      none = 0,
      pick = ifelse(df$dc1[i] == set[[y]]$concept, 1, 0)
    )
  }))
  return(df1)
}))

rbindlist减少了处理时间。