使用data.table有效地按组处理重复值

时间:2019-09-27 23:04:17

标签: r data.table

从按组重复(即每行相同的值)的列(variable)中获取单个值的首选方法是什么?我应该使用variable[1]还是应该在by语句中包含该变量并使用.BY$variable?假设我希望返回值包含variable作为列。

从以下测试中可以很明显地看出,在by语句中添加其他变量会减慢速度,甚至降低了使用该新变量的键控成本(或使用欺骗手段告诉data.table其他键是必需的)。为什么其他已经键入键的by变量会使速度变慢?

我想我曾希望包括已经键入键的by变量是将这些变量包括在return data.table中而不在j语句中明确命名的便捷语法技巧,但是这似乎是不可取的,因为即使已将它们与键值相关联,也存在与附加变量相关的某种开销。所以我的问题是,造成这种开销的原因是什么?

一些示例数据:

library(data.table)
n <- 1e8
y <- data.table(sample(1:5,n,replace=TRUE),rnorm(n),rnorm(n))
y[,sumV2:=sum(V2),keyby=V1]

计时显示,使用variable[1](在这种情况下为sumV2[1])的方法更快。

x <- copy(y)
system.time(x[, list(out=sum(V3*V2)/sumV2[1],sumV2[1]),keyby=V1])
system.time(x[, list(out=sum(V3*V2)/.BY$sumV2),keyby=list(V1,sumV2)])

我想这并不奇怪,因为data.table无法知道setkey(V1)和setkey(V1,sumV2)定义的组实际上是相同的。

我确实感到惊讶的是,即使将data.table键入了setkey(V1,sumV2)(并且我们完全忽略了设置新密钥所花费的时间),使用sumV2[1]还是更快。为什么会这样?

x <- copy(y)
setkey(x,V1,sumV2)
system.time(x[, list(out=sum(V3*V2)/sumV2[1],sumV2[1]),by=V1])
system.time(x[, list(out=sum(V3*V2)/.BY$sumV2),by=list(V1,sumV2)])

此外,进行setkey(x,V2,sumV2)所需的时间不可忽略。是否有任何办法通过告诉它密钥实际上并没有实质性的改变来欺骗data.table跳过实际上重新键入x的键?

x <- copy(y)
system.time(setkey(x,V1,sumV2))

回答我自己的问题,似乎仅通过分配“ sorted”属性即可在设置键时跳过排序。可以吗?它会破坏东西吗?

x <- copy(y)
system.time({
  setattr(x, "sorted", c("V1","sumV2"))
  x[, list(out=sum(V3*V2)/.BY$sumV2),by=list(V1,sumV2)]
})

这是不好的做法还是有可能破坏事情,我不知道。但是使用setattr欺骗比显式键入要快得多:

x <- copy(y)
system.time({
  setkey(x,V1,sumV2)
  x[, list(out=sum(V3*V2)/.BY$sumV2),by=list(V1,sumV2)]
})

但是即使在by语句中使用setattr技巧与结合使用sumV2仍然不如将sumV2完全退出by语句那样快:

x <- copy(y)
system.time(x[, list(out=sum(V3*V2)/sumV2[1],sumV2[1]),keyby=V1])

在我看来,在每个组中使用通过属性设置键并使用sumV2作为长度为1的变量 应该比仅在V1上输入键和使用sumV2 [1]更快。如果未将sumV2指定为by变量,则在为sumV2子集之前,需要为每个组生成sumV2[1]中重复值的整个向量。将此与sumV2by变量的情况进行比较,每个组中sumV2仅存在长度为1的向量。显然我在这里的推理是不正确的。谁能解释为什么?为什么与使用sumV2[1]欺骗之后将sumV2设为变量相比,setattr为什么是最快的选择?

顺便说一句,我惊讶地发现使用attr<-并不比setattr慢(都是瞬时的,这意味着完全没有复制)。这与我的理解相反,即基R foo<-函数会复制数据。

x <- copy(y)
system.time(setattr(x, "sorted", c("V1","sumV2")))
x <- copy(y)
system.time(attr(x,"sorted") <- c("V1","sumV2"))

与此问题相关的SessionInfo()

data.table version 1.12.2
R version 3.5.3

1 个答案:

答案 0 :(得分:0)

好的,我没有很好的技术答案,但是我认为我已经在options(datatable.verbose=TRUE)的帮助下从概念上解决了这个问题

创建数据

library(data.table)
n <- 1e8

y_unkeyed_5groups <- data.table(sample(1:10000,n,replace=TRUE),rnorm(n),rnorm(n))
y_unkeyed_5groups[,sumV2:=sum(V2),keyby=V1]
y_unkeyed_10000groups <- data.table(sample(1:10000,n,replace=TRUE),rnorm(n),rnorm(n))
y_unkeyed_10000groups[,sumV2:=sum(V2),keyby=V1]

慢跑

x <- copy(y)
system.time({
  setattr(x, "sorted", c("V1","sumV2"))
  x[, list(out=sum(V3*V2)/.BY$sumV2),by=list(V1,sumV2)]
})
# Detected that j uses these columns: V3,V2 
# Finding groups using uniqlist on key ... 1.050s elapsed (1.050s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'list(sum(V3 * V2)/.BY$sumV2)'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# memcpy contiguous groups took 0.305s for 6 groups
# eval(j) took 0.254s for 6 calls
# 0.560s elapsed (0.510s cpu) 
# user  system elapsed 
# 1.81    0.09    1.72 

快速运行:

x <- copy(y)
system.time(x[, list(out=sum(V3*V2)/sumV2[1],sumV2[1]),keyby=V1])
# Detected that j uses these columns: V3,V2,sumV2 
# Finding groups using uniqlist on key ... 0.060s elapsed (0.070s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'list(sum(V3 * V2)/sumV2[1], sumV2[1])'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# memcpy contiguous groups took 0.328s for 6 groups
# eval(j) took 0.291s for 6 calls
# 0.610s elapsed (0.580s cpu) 
# user  system elapsed 
# 1.08    0.08    0.82 

finding groups部分是造成差异的原因。我猜测这里发生的是设置key实际上只是排序(我应该从属性的命名方式中猜到了!),实际上没有做任何事情来定义组的开始和结束位置。因此,即使data.table知道sumV2已排序,也不知道它都是相同的值,因此必须找出sumV2中的组的开始和结束位置。

我的猜测是,在技术上可能会以这样的方式编写data.table,其中键不仅只是排序,而且实际上将每个组的开始/结束行存储在键变量中,但是这可能会占用有很多组的数据表的内存很大。

知道这一点后,似乎建议不要一遍又一遍地重复相同的by语句,而应该在单个by语句中完成所有您需要做的事情。一般而言,这可能是一个不错的建议,但少数群体却并非如此。请参见以下反例:

我使用data.table(仅使用一条by语句,并使用GForce)以最快的方式重写了这一点:

library(data.table)
n <- 1e8
y_unkeyed_5groups <- data.table(sample(1:5,n, replace=TRUE),rnorm(n),rnorm(n))
y_unkeyed_10000groups <- data.table(sample(1:10000,n, replace=TRUE),rnorm(n),rnorm(n))

x <- copy(y_unkeyed_5groups)
system.time({
  x[, product:=V3*V2]
  outDT <- x[,list(sumV2=sum(V2),sumProduct=sum(product)),keyby=V1]
  outDT[,`:=`(out=sumProduct/sumV2,sumProduct=NULL) ]
  setkey(x,V1)
  x[outDT,sumV2:=sumV2,all=TRUE]
  x[,product:=NULL]
  outDT
})

# Detected that j uses these columns: V3,V2 
# Assigning to all 100000000 rows
# Direct plonk of unnamed RHS, no copy.
# Detected that j uses these columns: V2,product 
# Finding groups using forderv ... 0.350s elapsed (0.810s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'list(sum(V2), sum(product))'
# GForce optimized j to 'list(gsum(V2), gsum(product))'
# Making each group and running j (GForce TRUE) ... 1.610s elapsed (4.550s cpu) 
# Detected that j uses these columns: sumProduct,sumV2 
# Assigning to all 5 rows
# RHS for item 1 has been duplicated because NAMED is 3, but then is being plonked. length(values)==2; length(cols)==2)
# forder took 0.98 sec
# reorder took 3.35 sec
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu) 
# Detected that j uses these columns: sumV2 
# Assigning to 100000000 row subset of 100000000 rows
# Detected that j uses these columns: product 
# Assigning to all 100000000 rows
# user  system elapsed 
# 11.00    1.75    5.33 


x2 <- copy(y_unkeyed_5groups)
system.time({
  x2[,sumV2:=sum(V2),keyby=V1]
  outDT2 <- x2[, list(sumV2=sumV2[1],out=sum(V3*V2)/sumV2[1]),keyby=V1]
})
# Detected that j uses these columns: V2 
# Finding groups using forderv ... 0.310s elapsed (0.700s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'sum(V2)'
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# collecting discontiguous groups took 0.714s for 5 groups
# eval(j) took 0.079s for 5 calls
# 1.210s elapsed (1.160s cpu) 
# setkey() after the := with keyby= ... forder took 1.03 sec
# reorder took 3.21 sec
# 1.600s elapsed (3.700s cpu) 
# Detected that j uses these columns: sumV2,V3,V2 
# Finding groups using uniqlist on key ... 0.070s elapsed (0.070s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'list(sumV2[1], sum(V3 * V2)/sumV2[1])'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# memcpy contiguous groups took 0.347s for 5 groups
# eval(j) took 0.265s for 5 calls
# 0.630s elapsed (0.620s cpu) 
# user  system elapsed 
# 6.57    0.98    3.99 

all.equal(x,x2)
# TRUE
all.equal(outDT,outDT2)
# TRUE

好吧,事实证明,只有5个组时,不通过语句重复并使用GForce所获得的效率并不重要。但是对于更多的组,这确实有所作为((尽管我没有以仅使用一个by语句而不使用GForce来区分收益与使用GForce和multi-by语句的收益来区分)的方式: / p>

x <- copy(y_unkeyed_10000groups)
system.time({
  x[, product:=V3*V2]
  outDT <- x[,list(sumV2=sum(V2),sumProduct=sum(product)),keyby=V1]
  outDT[,`:=`(out=sumProduct/sumV2,sumProduct=NULL) ]
  setkey(x,V1)
  x[outDT,sumV2:=sumV2,all=TRUE]
  x[,product:=NULL]
  outDT
})
# 
# Detected that j uses these columns: V3,V2 
# Assigning to all 100000000 rows
# Direct plonk of unnamed RHS, no copy.
# Detected that j uses these columns: V2,product 
# Finding groups using forderv ... 0.740s elapsed (1.220s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'list(sum(V2), sum(product))'
# GForce optimized j to 'list(gsum(V2), gsum(product))'
# Making each group and running j (GForce TRUE) ... 0.810s elapsed (2.390s cpu) 
# Detected that j uses these columns: sumProduct,sumV2 
# Assigning to all 10000 rows
# RHS for item 1 has been duplicated because NAMED is 3, but then is being plonked. length(values)==2; length(cols)==2)
# forder took 1.97 sec
# reorder took 11.95 sec
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu) 
# Detected that j uses these columns: sumV2 
# Assigning to 100000000 row subset of 100000000 rows
# Detected that j uses these columns: product 
# Assigning to all 100000000 rows
# user  system elapsed 
# 18.37    2.30    7.31 

x2 <- copy(y_unkeyed_10000groups)
system.time({
  x2[,sumV2:=sum(V2),keyby=V1]
  outDT2 <- x[, list(sumV2=sumV2[1],out=sum(V3*V2)/sumV2[1]),keyby=V1]
})

# Detected that j uses these columns: V2 
# Finding groups using forderv ... 0.770s elapsed (1.490s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'sum(V2)'
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# collecting discontiguous groups took 1.792s for 10000 groups
# eval(j) took 0.111s for 10000 calls
# 3.960s elapsed (3.890s cpu) 
# setkey() after the := with keyby= ... forder took 1.62 sec
# reorder took 13.69 sec
# 4.660s elapsed (14.4s cpu) 
# Detected that j uses these columns: sumV2,V3,V2 
# Finding groups using uniqlist on key ... 0.070s elapsed (0.070s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'list(sumV2[1], sum(V3 * V2)/sumV2[1])'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# memcpy contiguous groups took 0.395s for 10000 groups
# eval(j) took 0.284s for 10000 calls
# 0.690s elapsed (0.650s cpu) 
# user  system elapsed 
# 20.49    1.67   10.19 

all.equal(x,x2)
# TRUE
all.equal(outDT,outDT2)
# TRUE

更一般地说,data.table速度非常快,但是要提取出最有效地利用基础C代码的最快,最高效的计算,您需要特别注意data.table的内部原理。我最近在data.table中了解了GForce优化,似乎当有by语句时,j语句的特定形式(涉及诸如mean和sum的简单函数)可以直接在C中解析和执行。