我的数据文件中有以下类型的数据
DriveNo Date and Time
12 2017-01-31 23:00:00 //Start time of a trip for Driver12
134 2017-01-31 23:00:01
12 2017-01-31 23:10:00 //End time ( 10 min trip)
345 (some date/time)
12 2017-01-31 23:20:00 //Start Time
12 2017-01-31 23:35:00 //End Time (15 min trip)
.
.
.
数百万的类似数据
数据总数约为300万。现在,我需要花时间驱动我的每个驱动程序(大约有500个驱动程序)。我的理想输出就像
DriveNo TotalTimeDriven
12 35mins
134 ........(in days/hours/mins)
.
.
(对于所有其他司机)
上面,DriveNo 12有四个条目,建议两个游乐设施的开始和结束。是否有一个有效的R方式来做到这一点?
答案 0 :(得分:2)
数据表解决方案: -
# Sample data
df <- data.table(DriveNo = c(12, 134, 12, 134), Time = c("2017-01-31 23:00:00", "2017-01-31 23:00:01", "2017-01-31 23:10:00", "2017-01-31 23:20:01"))
df[, duration := max(as.POSIXct(Time)) - min(as.POSIXct(Time)), by = DriveNo]
df
DriveNo Time duration
1: 12 2017-01-31 23:00:00 10 mins
2: 134 2017-01-31 23:00:01 20 mins
3: 12 2017-01-31 23:10:00 10 mins
4: 134 2017-01-31 23:20:01 20 mins
答案 1 :(得分:1)
range
返回最大值和最小值,diff
减去向量中的序号,因此您可以这样做
aggregate(DateTime ~ DriveNo, df, function(x){diff(range(x))})
## DriveNo DateTime
## 1 12 10
## 2 134 0
或在dplyr中,
library(dplyr)
df %>% group_by(DriveNo) %>% summarise(TimeDriven = diff(range(DateTime)))
## # A tibble: 2 × 2
## DriveNo TimeDriven
## <int> <time>
## 1 12 10 mins
## 2 134 0 mins
或在data.table中,
library(data.table)
setDT(df)[, .(TimeDriven = diff(range(DateTime))), by = DriveNo]
## DriveNo TimeDriven
## 1: 12 10 mins
## 2: 134 0 mins
要更改单位,直接调用difftime
可能更简单。
数据
df <- structure(list(DriveNo = c(12L, 134L, 12L), DateTime = structure(c(1485921600,
1485921601, 1485922200), class = c("POSIXct", "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-3L), .Names = c("DriveNo", "DateTime"))
对于编辑,您可以使用difftime
和sum
创建一个标识开始和停止,重新整形和汇总的变量。
library(tidyverse)
set.seed(47)
drives <- data_frame(DriveNo = sample(rep(1:5, 4)),
DateTime = seq(as.POSIXct("2017-04-13 12:00:00"),
by = '10 min', length.out = 20))
drives %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame': 20 obs. of 2 variables:
#> $ DriveNo : int 5 3 4 3 5 1 1 2 3 5 ...
#> $ DateTime: POSIXct, format: "2017-04-13 12:00:00" "2017-04-13 12:10:00" ...
elapsed <- drives %>%
group_by(DriveNo) %>%
mutate(event = rep(c('start', 'stop'), n() / 2),
i = cumsum(event == 'start')) %>%
spread(event, DateTime) %>%
summarise(TimeDriven = sum(difftime(stop, start, units = 'mins')))
elapsed
#> # A tibble: 5 × 2
#> DriveNo TimeDriven
#> <int> <time>
#> 1 1 60 mins
#> 2 2 110 mins
#> 3 3 120 mins
#> 4 4 130 mins
#> 5 5 80 mins
通过回收的布尔向量进行索引会更快,但在dplyr中,它们在某些时候会被取消。在data.table中,
library(data.table)
set.seed(47)
drives <- data.table(DriveNo = sample(rep(1:5, 4)),
DateTime = seq(as.POSIXct("2017-04-13 12:00:00"),
by = '10 min', length.out = 20))
elapsed <- drives[, .(TimeDriven = sum(difftime(DateTime[c(FALSE, TRUE)],
DateTime[c(TRUE, FALSE)],
units = 'mins'))),
keyby = DriveNo]
elapsed
#> DriveNo TimeDriven
#> 1: 1 60 mins
#> 2: 2 110 mins
#> 3: 3 120 mins
#> 4: 4 130 mins
#> 5: 5 80 mins
或在基地,
set.seed(47)
drives <- data.frame(DriveNo = sample(rep(1:5, 4)),
DateTime = seq(as.POSIXct("2017-04-13 12:00:00"),
by = '10 min', length.out = 20))
elapsed <- aggregate(DateTime ~ DriveNo, drives,
function(x){sum(difftime(x[c(FALSE, TRUE)], x[c(TRUE, FALSE)], units = 'mins'))})
elapsed
#> DriveNo DateTime
#> 1 1 60
#> 2 2 110
#> 3 3 120
#> 4 4 130
#> 5 5 80
如果驾驶员有奇数次,所有表格都可能出现问题,这在给定的假设下是不可能的。如果是,则需要更多清洁。