在r

时间:2018-01-15 02:54:17

标签: r data.table

我有两个表,一个带有时间序列数据(dat),另一个带有一些参考点(pts),用于一堆不同的观察(time.group和well)。请参阅最小示例表:

set.seed(5)
dat = data.table ( time.group = c (rep ("base", 42), rep ("4h", 42)), 
                   well = c (rep ("A1", 20), rep ("B1", 22), rep ("A1", 19), rep ("B1", 23)),
                   frame = c(1:20, 1:22, 1:19, 1:23),
                   signal = runif (84, 0, 1) )

pts = data.table (time.group = c (rep ("base", 2), rep ("4h", 2)),
                  well = rep (c ("A1", "B1"), 2),
                  frame.start = c (3, 4, 3, 6),
                  frame.stop = c (17, 18, 12, 19) )

head (dat)
   time.group well frame    signal

1:       base   A1     1 0.2002145
2:       base   A1     2 0.6852186
3:       base   A1     3 0.9168758
4:       base   A1     4 0.2843995
5:       base   A1     5 0.1046501
6:       base   A1     6 0.7010575

head (pts)
   time.group well frame.start frame.stop
1:       base   A1           3         17
2:       base   B1           4         18
3:         4h   A1           3         12
4:         4h   B1           6         19

我想提取每个时间帧的帧。好了,其中信号在dat表中最高,在frame.start和frame.stop的帧之间来自pts表

最有效的方法是什么,因为我有大量的time.groups和well以及其他一些“信号”类数据列的大型数据集?

这些是我到目前为止提出的策略:

示例1:这有效,但我认为这是多余/慢的,因为它基本上必须执行两次“by”分组:

dat [pts, .(time.group, well, frame = x.frame, signal), # returns dat's frame column (desired)
   on = .(time.group, well, frame >= frame.start, frame <= frame.stop) # non-equi join, groups once
 ][ ,
    .SD [which.max (signal), .(plus = frame)], # extracting frame at max (signal)
    by = .(time.group, well)] # groups again
>>>>>
   time.group well plus
1:       base   A1    9
2:       base   B1    8
3:         4h   A1   12
4:         4h   B1    8

示例2:在这里,如果我添加带有第一个帧列(-1)的i.plus列,我会得到正确的数字,但是我不能这样做而且它会跳出来因为在连接后输出中有两列名为“frame”的列。

此外,如果帧不是从每个组的1开始,它将无效:

dat [pts,
       on = .(time.group, well, frame >= frame.start, frame <= frame.stop), # non-equi join
     .(i.plus = which.max (signal)), # if I add i.plus and the first column frame, -1, it gives what I want, but there are two columns named frame
     by = .EACHI
     ]
>>>>>>
   time.group well frame frame i.plus
1:       base   A1     3    17      7
2:       base   B1     4    18      5
3:         4h   A1     3    12     10
4:         4h   B1     6    19      3

示例3:这也有效,并提供了与示例1相同的表格,但看起来好像很多代码:

tmp = 
dat [pts,
     on = .(time.group, well, frame >= frame.start, frame <= frame.stop),
     .(plus = .I [which.max (signal)] ), # returns row indeces from orginal data.table (dat)
     by = .EACHI][["plus"]] 

dat [tmp, .(time.group, well, plus = frame)] # extract from original table

示例4:并且这不会从dat返回原始帧列,而是从pts返回列,因此我无法访问与dat中的max(signal)对应的帧:

dat [pts,
       on = .(time.group, well, frame >= frame.start, frame <= frame.stop), # non-equi join
     .SD [which.max (signal) ], # does not return original frame column (x.frame), so I can't extract it
     by = .EACHI
     ]
>>>>>>>>
   time.group well frame frame    signal
1:       base   A1     3    17 0.9565001
2:       base   B1     4    18 0.9659641
3:         4h   A1     3    12 0.9758776
4:         4h   B1     6    19 0.9304595

我不确定我是否应该从一个完全不同的角度来解决这个问题并尝试将pts加入到dat中,我不知道!如果有更多优雅的方法来完成这一点的任何见解,非常感谢!

我还要注意,提出一个最佳策略来实现这一点非常重要,因为我将多次进行这些类型的数据提取,所以我已经开始讨论它了一段时间现在:(

谢谢!

1 个答案:

答案 0 :(得分:2)

这是你正在寻找的吗?

dat[pts, on = .(time.group, well, frame >= frame.start, frame <= frame.stop),
     .(plus = x.frame[which.max(signal)]),
     by = .EACHI]
#    time.group well frame frame plus
# 1:       base   A1     3    17    9
# 2:       base   B1     4    18    8
# 3:         4h   A1     3    12   12
# 4:         4h   B1     6    19    8

由于某些原因,使用frame代替x.frame,即frame[which.max(signal)],会返回所有NA,我认为这是错误 ..你可以通过链接到这篇文章来file an issue吗?感谢。