如何有效地使用which()比较r中的一列和一行?

时间:2016-05-16 23:32:52

标签: r

数据

我有多个车辆数据集,每个都有唯一的ID Vehicle.ID2。以下是仅1辆车的数据的一部分:

df <- structure(list(Vehicle.ID2 = c("4-2", "4-2", "4-2", "4-2", "4-2", 
"4-2", "4-2", "4-2", "4-2", "4-2", "4-2", "4-2", "4-2", "4-2", 
"4-2", "4-2", "4-2", "4-2", "4-2", "4-2"), Time = c(3, 3.2, 3.4, 
3.6, 3.8, 4, 4.2, 4.4, 4.6, 4.8, 5, 5.2, 5.4, 5.6, 5.8, 6, 6.2, 
6.4, 6.6, 6.8), yposition = c(3.451, 7.357, 11.264, 15.171, 19.077, 
22.984, 26.89, 30.797, 34.704, 38.61, 42.517, 46.423, 50.33, 
54.236, 58.143, 62.05, 65.956, 69.863, 73.769, 77.676), LeadVehyposition2 = c(55.043, 
NA, 64.098, 68.626, 73.153, 77.681, 82.209, 86.736, 91.264, 95.791, 
100.319, 104.847, 109.374, 113.902, 118.429, 122.957, 127.485, 
132.012, 136.54, 141.067)), .Names = c("Vehicle.ID2", "Time", 
"yposition", "LeadVehyposition2"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -20L))

我想做什么

我想将LeadVehyposition2yposition中的df进行比较,并输出Time大于或等于{{yposition的{​​{1}} 1}}。对于1辆车,我可以使用以下代码为LeadVehyposition2中的第1个值:

执行此操作
LeadVehyposition2

此处,df$Time[head(which(df$yposition>=55.043),1)] > 5.8 中的第一个值为55.043,我将其与LeadVehyposition2中的所有值进行了比较。我想对yposition中的所有值执行相同操作。以下是不适用于整个数据集的代码(多个车辆ID):

LeadVehyposition2

问题:

问题是使用第二段代码仅按行比较library(dplyr) mydata %>% group_by(Vehicle.ID2) %>% mutate(Time.PET = Time[head(which(yposition>=LeadVehyposition2),1)]%>% ungroup() yposition的值。但是,目标是保持LeadVehyposition2不变,并将其与LeadVehyposition2的整列进行比较。我怎么解决这个问题?

3 个答案:

答案 0 :(得分:3)

这是在base;

中执行此操作的可能方法
df$Time[sapply(df$LeadVehyposition2, function(p) min(which(df$yposition >= p)))]
 [1] 5.8  NA 6.2 6.4 6.6  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

或者:

with(df, Time[sapply(LeadVehyposition2, function(p) min(which(yposition >= p)))])
 [1] 5.8  NA 6.2 6.4 6.6  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

按照车辆问题处理分组:

df <- df[order(df$Vehicle.ID2, df$Time), ]
do.call(c, sapply(split(df, df$Vehicle.ID2), function(df) 
        with(df, Time[sapply(LeadVehyposition2, function(p) min(which(yposition >= p)))])))

答案 1 :(得分:3)

data.table方法可以将df加入自身,然后采用Timeyposition之间的正差异的最小LeadVehyposition2

library(data.table)
setDT(df)

res <- df[ df[, .(Vehicle.ID2, Time, yposition)], on = c("Vehicle.ID2"), allow.cartesian=T][i.yposition - LeadVehyposition2 > 0, .(min(i.Time)), by = .(Vehicle.ID2, Time, LeadVehyposition2)]
res
#     Vehicle.ID2 Time LeadVehyposition2  V1
# 1:         4-2  3.0            55.043 5.8
# 2:         4-2  3.4            64.098 6.2
# 3:         4-2  3.6            68.626 6.4
# 4:         4-2  3.8            73.153 6.6

将此内容加入df会将额外的列添加到原始数据

res[df, on = c("Vehicle.ID2","Time","LeadVehyposition2")]

#      Vehicle.ID2 Time LeadVehyposition2  V1 yposition
#  1:         4-2  3.0            55.043 5.8     3.451
#  2:         4-2  3.2                NA  NA     7.357
#  3:         4-2  3.4            64.098 6.2    11.264
#  4:         4-2  3.6            68.626 6.4    15.171
#  5:         4-2  3.8            73.153 6.6    19.077
#  6:         4-2  4.0            77.681  NA    22.984
# ...
# 17:         4-2  6.2           127.485  NA    65.956
# 18:         4-2  6.4           132.012  NA    69.863
# 19:         4-2  6.6           136.540  NA    73.769
# 20:         4-2  6.8           141.067  NA    77.676

答案 2 :(得分:2)

您可以使用滚动连接:

library(data.table)
setDT(df)

# create an index to be used for matching
df[, idx := 1:.N, by = Vehicle.ID2]

# find the matching index using rolling joins
df[, idx.m := .SD[.SD, on = c('Vehicle.ID2', yposition = 'LeadVehyposition2'), roll = T,
                 idx + 1]][1:5]
#   Vehicle.ID2 Time yposition LeadVehyposition2 idx idx.m
#1:         4-2  3.0     3.451            55.043   1    15
#2:         4-2  3.2     7.357                NA   2    NA
#3:         4-2  3.4    11.264            64.098   3    17
#4:         4-2  3.6    15.171            68.626   4    18
#5:         4-2  3.8    19.077            73.153   5    19

# get the time for each match
df[, Time.PET := Time[idx.m], by = Vehicle.ID2][1:5]
#   Vehicle.ID2 Time yposition LeadVehyposition2 idx idx.m Time.PET
#1:         4-2  3.0     3.451            55.043   1    15      5.8
#2:         4-2  3.2     7.357                NA   2    NA       NA
#3:         4-2  3.4    11.264            64.098   3    17      6.2
#4:         4-2  3.6    15.171            68.626   4    18      6.4
#5:         4-2  3.8    19.077            73.153   5    19      6.6

如果ypositionLeadVehyposition2严格相等,我建议为yposition添加非常小的(正)抖动,以使上述方法正常工作。

添加非等联接的data.table latest development version的另一个选项可以是:

library(data.table)
setDT(df)

df[df, on = .(Vehicle.ID2, yposition >= LeadVehyposition2), Time[1], by = .EACHI][1:5]
#   Vehicle.ID2 yposition  V1
#1:         4-2    55.043 5.8
#2:         4-2        NA  NA
#3:         4-2    64.098 6.2
#4:         4-2    68.626 6.4
#5:         4-2    73.153 6.6

其中的内容是 - 在df相同且Vehicle.ID2大于或等于yposition的列上自行加入LeadVehyposition2,然后取第一个{{1}每个“i”(又名Time的第一个参数)。

您当然可以将其指定为列:

[.data.table

注意:两个答案均假设df[, Time.PET := .SD[.SD, on = .(Vehicle.ID2, yposition >= LeadVehyposition2), Time[1], by = .EACHI]$V1] 已按升序排序。