使用R中的foverlaps选择分数最高的范围ID

时间:2017-01-21 13:47:47

标签: r dataframe data.table

我想在一个值位于重叠范围之间的设置中,使用>df1 AthleteID Distance Athlete1 5 Athlete2 10 Athlete3 25 >df2 CheckpointID Start End Score Checkpoint1 1 8 2 Checkpoint2 7 12 4 Checkpoint3 9 15 6 Checkpoint4 16 26 8 Checkpoint5 20 30 10 从单独的列中选择最大数量的范围ID。虽然我对包的基本设置非常熟悉,但我找不到执行上述功能的方法。

这是一个小例子

>df1
AthleteID  Distance   Score  CheckpointID
Athlete1   5          2      Checkpoint1
Athlete2   10         6      Checkpoint3
Athlete3   25         10     Checkpoint5

根据以上内容,最终的data.frame应如下所示

>df2
CheckpointID   AthleteID   Start   End Score
Checkpoint1    Athlete1    1       8   2
Checkpoint2    Athlete1    7       12  4
Checkpoint3    Athlete1    9       15  6
Checkpoint4    Athlete1    16      26  8
Checkpoint5    Athlete1    20      30  10
Checkpoint1    Athlete2    1       8   3
Checkpoint2    Athlete2    7       12  5
Checkpoint3    Athlete2    9       15  7
Checkpoint4    Athlete2    16      26  9
Checkpoint5    Athlete2    20      30  11
Checkpoint1    Athlete3    1       8   1
Checkpoint2    Athlete3    7       12  3
Checkpoint3    Athlete3    9       15  5
Checkpoint4    Athlete3    16      26  7
Checkpoint5    Athlete3    20      30  11

=========================

修改

最后一个问题;我也有兴趣了解如何根据运动员ID使用不同的检查点分数(相同的间隔)。这是一个修改过的分数表

>df1
AthleteID  Distance   Score  CheckpointID
Athlete1   5          2      Checkpoint1
Athlete2   10         7      Checkpoint3
Athlete3   25         11     Checkpoint5

所以最后的结果看起来像这样

{{1}}

2 个答案:

答案 0 :(得分:6)

您也可以使用新实现的non-equi联接来实现,这应该更直接......

y[x, on = .(Start <= Distance, End >= Distance), mult = "last", 
    .(AthleteID, Distance, Score, CheckpointID)]

其中,

x=fread("AthleteID  Distance
        Athlete1   5
        Athlete2   10
        Athlete3   25
        ")
y=fread("CheckpointID   Start   End Score
    Checkpoint1    1       8   2
    Checkpoint2    7       12  4
    Checkpoint3    9       15  6
    Checkpoint4    16      26  8
    Checkpoint5    20      30  10
    ")

答案 1 :(得分:3)

您可以这样使用foverlaps。关键是在Distance中复制df1列,以创建一个起始等于结束的人工间隔。然后,使用foverlaps加入df1df2,以查看[Distance, Distance2 (=Distance)]落在[Start, End] df2内的行,并保持最后匹配。

library(data.table)

df1 <- fread("
AthleteID  Distance
Athlete1   5
Athlete2   10
Athlete3   25
")

df2 <- fread("
CheckpointID   Start   End Score
Checkpoint1    1       8   2
Checkpoint2    7       12  4
Checkpoint3    9       15  6
Checkpoint4    16      26  8
Checkpoint5    20      30  10
")

# Need a duplicated temp column as end of interval
df1[, Distance2 := Distance]
#>    AthleteID Distance Distance2
#> 1:  Athlete1        5         5
#> 2:  Athlete2       10        10
#> 3:  Athlete3       25        25

# y must be keyed in foverlaps
setkey(df2, Start, End)

# use type within and mult last, then select column
foverlaps(df1, df2, by.x = c("Distance", "Distance2"), mult = "last", type = "within")[, .(AthleteID, Distance, Score, CheckpointID)]
#>    AthleteID Distance Score CheckpointID
#> 1:  Athlete1        5     2  Checkpoint1
#> 2:  Athlete2       10     6  Checkpoint3
#> 3:  Athlete3       25    10  Checkpoint5

# Delete temp column in df1
df1[, Distance2 := NULL]