我有一个数据表,如下所示:
dt=data.table(ID=rep(1:6,each=2),Set=c("a","b"),Value=rnorm(12))
我想要一个data.table解决方案,结果是这样的:
dt.test=data.table(ID=dt[Set=="a",Value,by=ID][order(ID),ID],Value=dt[Set=="a",Value,by=ID][order(ID),Value]-dt[Set=="b",Value,by=ID][order(ID),Value])
但我不想打两次dt
才能成为Value-Value
。
任何帮助将不胜感激。
答案 0 :(得分:2)
您可以这样做:
# reshape the data table
df.test <- dcast(data = dt, ID ~ Set, value.var = 'Value')
# create new column
df.test <- df.test[, Values := a - b][,.(ID, Values)]
答案 1 :(得分:2)
迄今为止最快的解决方案:
dt[ dt[, .I[1], ID ]$V1][, `:=`(Set = NULL, Value = Value - dt[dt[, .I[2], ID ]$V1]$Value)][]
基准
microbenchmark::microbenchmark(
ricardo.solution = data.table(ID=dt[Set=="a",Value,by=ID][order(ID),ID],Value=dt[Set=="a",Value,by=ID][order(ID),Value]-dt[Set=="b",Value,by=ID][order(ID),Value]),
YOLO.solution = {df.test <- dcast(data = dt, ID ~ Set, value.var = 'Value')
df.test <- df.test[, Values := a - b][,.(ID, Values)]},
this.answer = dt[ dt[, .I[1], ID ]$V1][, `:=`(Set = NULL, Value = Value - dt[dt[, .I[2], ID ]$V1]$Value)][]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# ricardo.solution 5.197409 5.338703 6.024780 5.526993 5.748874 45.018507 100
# YOLO.solution 2.617082 2.719060 3.006002 2.784585 2.948021 17.891175 100
# this.answer 1.432812 1.479057 1.525045 1.502857 1.539461 1.896759 100
答案 2 :(得分:1)
使用shift
处理后续不同行中同一列的值:
dt[, .(Value = Value - shift(Value, type = "lead")), by = ID][!is.na(Value), ]
产生
ID Value
1: 1 0.7455129
2: 2 -0.1529171
3: 3 -1.0823840
4: 4 1.5533354
5: 5 -1.2234450
6: 6 -0.5912473
基准(对@Wimpel表示感谢,以提供代码库!):
microbenchmark::microbenchmark(
ricardo.solution = data.table(ID=dt[Set=="a",Value,by=ID][order(ID),ID],Value=dt[Set=="a",Value,by=ID][order(ID),Value]-dt[Set=="b",Value,by=ID][order(ID),Value]),
YOLO.solution = {df.test <- dcast(data = dt, ID ~ Set, value.var = 'Value')
df.test <- df.test[, Values := a - b][,.(ID, Values)]},
this.answer = dt[ dt[, .I[1], ID ]$V1][, `:=`(Set = NULL, Value = Value - dt[dt[, .I[2], ID ]$V1]$Value)][],
shift = dt[, .(Value = Value - shift(Value, type = "lead")), by = ID][!is.na(Value), ]
)
显示
Unit: microseconds
expr min lq mean median uq max neval
ricardo.solution 4580.061 4992.562 5328.8262 5285.678 5473.8425 7620.749 100
YOLO.solution 2272.906 2493.514 2739.5670 2635.498 2779.4650 4996.913 100
this.answer 1343.026 1446.103 1546.7915 1517.141 1630.6945 2374.776 100
shift 559.784 601.365 666.5094 629.343 665.7745 2201.881 100
请注意,基准测试对于这种少量的输入数据不是很有代表性...