一种有效的方法来查找数据帧的行号,条件不一致

时间:2016-01-08 02:49:28

标签: r search indexing

我们正在考虑服务器的延迟,该服务器只能同时照顾一个客户。假设我们有两个数据框:agg_dataind_data

> agg_data
  minute service_minute
1      0    1
2     60    3
3    120    2
4    180    3
5    240    2
6    300    4

agg_data为每小时的两个连续客户提供服务时间。例如,在60到120之间(从一开始的第二个小时),我们可以每3分钟为一位新客户提供服务,我们可以在给定时间内为20位客户提供服务。

ind_data提供每位客户的到达时间:

         Arrival
1             51
2             63
3            120
4            121
5            125
6            129

我需要为受service_minute agg_data影响的客户生成出发时间。

输出如下:

         Arrival              Dep
1             51               52
2             63               66
3            120              122
4            121              124
5            125              127
6            129              131

这是我当前的代码,这是正确但非常低效的:

ind_data$Dep = rep(0,now(ind_data))
# After the service time, the first customer can leave the system with no delay
# Service time is taken as that of the hour when the customer arrives
ind_data$Dep[1] = ind_data$Arrival[1] + agg_data[max(which(agg_data$minute<=ind_data$Arrival[1])),'service_minute']

# For customers after the first one, 
# if they arrive when there is no delay (arrival time > departure time of the previous customer), 
# then the service time is that of the hour when the arrive and 
# departure time is arrival time + service time; 
# if they arrive when there is delay (arrival time < departure time of the previous customer), 
# then the service time is that of the hour when the previous customer leaves the system and 
# the departure time is the departure time of the previous customer + service time.

for (i in 2:nrow(ind_data)){
ind_data$Dep[i] = max(
ind_data$Dep[i-1] + agg_data[max(which(agg_data$minute<=ind_data$Dep[i-1])),'service_minute'],
ind_data$Arrival[i] + agg_data[max(which(agg_data$minute<=ind_data$Arrival[i])),'service_minute']
                )
}

我认为这是我们在agg_data中搜索正确服务时间需要很长时间的步骤。有更高效的算法吗?

谢谢。

1 个答案:

答案 0 :(得分:2)

这应该相当有效。这是一个非常简单的查找问题,具有明显的矢量化解决方案:

out <- data.frame(Arrival = ind_data$Arrival,
         Dep = ind_data$Arrival + agg_data$service_minute[ # need an index to choose min
                              findInterval(ind_data$Arrival, agg_data$minute)] 
 )

> out
  Arrival Dep
1      51  52
2      63  66
3     120 122
4     121 123
5     125 127
6     129 131

我相信我的代码比你的例子更多。我认为它有明显的错误。