为什么计算每一行比检查计算是否已完成更快?

时间:2016-12-07 13:47:40

标签: r performance dplyr lapply

我正在处理一个包含200,000多行和多列的数据帧。让我们采用样本虚拟版本df:

set.seed(1)
"timeslot" = c(as.integer(abs(runif(10000,min=1,max=1000))))
"ID" = c(LETTERS[abs(as.integer(rnorm(10000,2)**3))%%9+1])
"variable1" = c(as.integer(rnorm(10000,2)**3))
df = data.frame(timeslot,ID,variable1)
df = df[order(df$timeslot, df$ID),]

我还计算一列来检查该行的ID是否也出现在上一个时间段的某个地方,称为min1:

df$min1 <- sapply(seq(nrow(df)), function(x)
{
  if(df[x, "timeslot"] == 1){0} else {
    max(df[x, "ID"] %in% df[df$timeslot == df[x,"timeslot"] - 1,"ID"])}
})

这一切都很顺利,并提供以下头部(df)/尾部(df):

     timeslot ID variable1 min1
4919        1  A        15    0
2329        1  C        48    0
7359        1  C         1    0
1978        1  E         6    0
2883        1  F         7    0
7448        1  F        21    0
-------------------------------
8462      998  F         1    1
1724      998  H         2    0
989       999  A         7    1
2589      999  D        12    1
3473      999  D         0    1
780       999  I         5    0

我想对variable1执行一些计算,按唯一时隙+ ID分组。其中一项计算是funfac:

total=0
funfac <- function(x,y){  for (i in x){ (i <- i ** y);
total <- total + i};return((abs(total/(length(x))))**(1/y));total=0 }

然而,现在出现了困难的部分:在特定时间段中的每个ID我想对该时隙和前一个时隙中的所有相同ID进行计算。因此,如果在时隙'2'中存在3x D,并且在时隙'1'中存在2x D,则应该在所有5个D上进行计算。我的列min1有助于识别前一个时间段中是否存在该ID。如果不是:计算应返回NA。

首先,我使用以下代码执行此操作:

lp5 = c() 
for (j in 1:nrow(df)){
  if (df[j,"min1"] == 0){lp5 = c(lp5,NA)} else {
    total = 0
    x = df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]
    for (i in x){
      i = (i ** 5);
      total <- total + i
    }
    lp5 = c(lp5,((abs(total/(length(x))))**(1/(5))))
  }
}
tempdf = data.frame(df[,"timeslot"],df[,"ID"], lp5)
lp5 = tempdf[!duplicated(tempdf[,1:2]),][,3]

认为我进行了大量的计算加倍,我想:为什么不检查计算是否已经完成。通过在数据框中添加唯一的时间范围+ ID(包括计算值)来实现此目的。每次检查值是否已经在数据帧中。

lp5DF = data.frame("timeslot" = numeric(0), "ID" = character(0), "lp5" = numeric(0))
for (j in 1:nrow(df)){
  if (duplicated(rbind(lp5DF[,1:2],data.frame(timeslot=df[j,"timeslot"], ID=df[j,"ID"])))[nrow(lp5DF)+1]) {next} else{
    if (df[j,"min1"] == 0){lp5DF = rbind(lp5DF, data.frame("timeslot" = df[j,"timeslot"], "ID" = df[j,"ID"], "lp5" = NA))} else {
      total = 0
      x = df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]
      for (i in x){
        (i <- i ** 5);total <- total + i
      }
      lp5DF = rbind(lp5DF, data.frame("timeslot" = df[j,"timeslot"], "ID" = df[j,"ID"], "lp5" = ((abs(total/(length(x))))**(1/5))))               }
  }
}

lp5DF的输出(头/尾)将是:

  timeslot ID lp5
1        1  A  NA
2        1  B  NA
3        1  C  NA
4        1  D  NA
5        1  E  NA
6        1  F  NA
-------------------------
7738      999  B 14.83423
7739      999  C 14.80149
7740      999  E       NA
7741      999  F 49.48538
7742      999  G 23.05222
7743      999  H       NA

和:lp5DF[,3]==lp5

但是,检查这个似乎要慢很多(在我的情况下是6.5倍)。由于我必须在很多行上多次运行这种计算(数据帧可能会在项目的后期扩展),因此我的方法都太慢了。为什么第二个这么慢,有没有办法加速这个?也许包含lapplydplyr包的内容?

1 个答案:

答案 0 :(得分:2)

要优化很多。尝试学习数据操作包,例如dplyrdata.table

可以使用here

中的技术计算

min1

library(dplyr)
dfs <- split(df$ID, df$timeslot)
df$min1 <- unlist(mapply(`%in%`, dfs,  lag(dfs)))

lp5有点棘手,但可管理

df1 <- df %>% 
  group_by(timeslot, ID) %>% 
  summarise(min1 = all(min1), s = sum(variable1^5), n = n()) %>% 
  group_by(ID) %>% 
  mutate(s1 = s + lag(s), n1 = n + lag(n), lp5 = ifelse(min1, abs((s1/n1)^(1/5)), NA)) 
lp5 <- df1$lp5

data.table等效于

library(data.table)
setDT(df)
dt1 <- df[, .(min1 = all(min1), s = sum(variable1^5), n = .N), by=.(timeslot, ID)]
dt1[, `:=`(s1 = s + shift(s), n1 = n + shift(n)), by=ID]
dt1[min1==TRUE, lp5 := abs((s1/n1)^(1/5)), by=ID]
lp5 <- dt1$lp5