Question

我有一个数据表，如下所示：

dt=data.table(ID=rep(1:6,each=2),Set=c("a","b"),Value=rnorm(12))

我想要一个data.table解决方案，结果是这样的：

dt.test=data.table(ID=dt[Set=="a",Value,by=ID][order(ID),ID],Value=dt[Set=="a",Value,by=ID][order(ID),Value]-dt[Set=="b",Value,by=ID][order(ID),Value])

但我不想打两次dt才能成为Value-Value。

任何帮助将不胜感激。

Answer 1

您可以这样做：

# reshape the data table
df.test <- dcast(data = dt, ID ~ Set, value.var = 'Value')

# create new column
df.test <- df.test[, Values := a - b][,.(ID, Values)]

Answer 2

迄今为止最快的解决方案：

dt[ dt[, .I[1], ID ]$V1][, `:=`(Set = NULL, Value = Value - dt[dt[, .I[2], ID ]$V1]$Value)][]

基准

microbenchmark::microbenchmark(
  ricardo.solution = data.table(ID=dt[Set=="a",Value,by=ID][order(ID),ID],Value=dt[Set=="a",Value,by=ID][order(ID),Value]-dt[Set=="b",Value,by=ID][order(ID),Value]),
  YOLO.solution = {df.test <- dcast(data = dt, ID ~ Set, value.var = 'Value')
                   df.test <- df.test[, Values := a - b][,.(ID, Values)]},
  this.answer = dt[ dt[, .I[1], ID ]$V1][, `:=`(Set = NULL, Value = Value - dt[dt[, .I[2], ID ]$V1]$Value)][]
)

# Unit: milliseconds
#             expr      min       lq     mean   median       uq       max neval
# ricardo.solution 5.197409 5.338703 6.024780 5.526993 5.748874 45.018507   100
#    YOLO.solution 2.617082 2.719060 3.006002 2.784585 2.948021 17.891175   100
#      this.answer 1.432812 1.479057 1.525045 1.502857 1.539461  1.896759   100

Answer 3

使用shift处理后续不同行中同一列的值：

dt[, .(Value = Value - shift(Value, type = "lead")), by = ID][!is.na(Value), ]

产生

   ID      Value
1:  1  0.7455129
2:  2 -0.1529171
3:  3 -1.0823840
4:  4  1.5533354
5:  5 -1.2234450
6:  6 -0.5912473

基准（对@Wimpel表示感谢，以提供代码库！）：

microbenchmark::microbenchmark(
  ricardo.solution = data.table(ID=dt[Set=="a",Value,by=ID][order(ID),ID],Value=dt[Set=="a",Value,by=ID][order(ID),Value]-dt[Set=="b",Value,by=ID][order(ID),Value]),
  YOLO.solution = {df.test <- dcast(data = dt, ID ~ Set, value.var = 'Value')
  df.test <- df.test[, Values := a - b][,.(ID, Values)]},
  this.answer = dt[ dt[, .I[1], ID ]$V1][, `:=`(Set = NULL, Value = Value - dt[dt[, .I[2], ID ]$V1]$Value)][],
  shift = dt[, .(Value = Value - shift(Value, type = "lead")), by = ID][!is.na(Value), ]
)

显示

Unit: microseconds
             expr      min       lq      mean   median        uq      max neval
 ricardo.solution 4580.061 4992.562 5328.8262 5285.678 5473.8425 7620.749   100
    YOLO.solution 2272.906 2493.514 2739.5670 2635.498 2779.4650 4996.913   100
      this.answer 1343.026 1446.103 1546.7915 1517.141 1630.6945 2374.776   100
            shift  559.784  601.365  666.5094  629.343  665.7745 2201.881   100

请注意，基准测试对于这种少量的输入数据不是很有代表性...

在data.table中将j应用于两组i和by

3 个答案: