按条件间隔拆分数据帧

时间:2020-04-08 03:01:27

标签: r dataframe

我有一个包含动物ID和时间戳的数据框(这是简化的GPS数据)。 df按日期/时间排序。我想创建一个确定行程编号的列。如果一次与下一次之间的间隔大于28800秒,则跳闸被拆分。

#some sample data
timestamp <- as.POSIXct(c("18/01/2020 06:43:38", "18/01/2020 06:44:14", "18/01/2020 16:45:07" ,"18/01/2020 16:46:07"), tz = "UTC", format = "%d/%m/%Y %H:%M:%S")
    data <- data.frame("ID" = c("a","b","c","d"), "timestamp" = timestamp)

#ORIGINAL DATAFRAME
#   ID           timestamp
#1  a 2020-01-18 06:43:38
#2  b 2020-01-18 06:44:14
#3  c 2020-01-18 16:45:07
#4  d 2020-01-18 16:46:07

data$interval <- data$timestamp - lag(data$timestamp, n = 1L) #calculates time difference between points
data$trip <- c(1,1,2,2) # THIS IS THE LINE I NEED HELP WITH

#DATAFRAME I WANT IN THE END
#ID           timestamp   interval trip
#1  a 2020-01-18 06:43:38    NA secs    1
#2  b 2020-01-18 06:44:14    36 secs    1
#3  c 2020-01-18 16:45:07 36053 secs    2
#4  d 2020-01-18 16:46:07    60 secs    2

我也可以对数据进行子集化(请参见下面的示例)。

$`1`
  ID           timestamp interval 
1  a 2020-01-18 06:43:38  NA secs    
2  b 2020-01-18 06:44:14  36 secs    

$`2`
  ID           timestamp   interval 
3  c 2020-01-18 16:45:07 36053 secs    
4  d 2020-01-18 16:46:07    60 secs    

我正在努力解释自己,我希望这有道理!

2 个答案:

答案 0 :(得分:2)

data.table中执行此操作的另一种方法:

library(data.table)
setDT(data)[, interval := difftime(timestamp, shift(timestamp), units = "secs")][
            ,     trip := 1 + cumsum(ifelse(is.na(interval > 28800), 0, interval > 28800))][]

#>    ID           timestamp   interval trip
#> 1:  a 2020-01-18 06:43:38    NA secs    1
#> 2:  b 2020-01-18 06:44:14    36 secs    1
#> 3:  c 2020-01-18 16:45:07 36053 secs    2
#> 4:  d 2020-01-18 16:46:07    60 secs    2
split(data, by=c("trip"), keep.by = FALSE)

#> $`1`
#>    ID           timestamp interval
#> 1:  a 2020-01-18 06:43:38  NA secs
#> 2:  b 2020-01-18 06:44:14  36 secs
#> 
#> $`2`
#>    ID           timestamp   interval
#> 1:  c 2020-01-18 16:45:07 36053 secs
#> 2:  d 2020-01-18 16:46:07    60 secs

答案 1 :(得分:1)

您可以使用diffcumsum

data$interval <- c(NA, diff(data$timestamp))
data$trips <- cumsum(c(TRUE,  data$interval[-1] >28800))
data

#  ID           timestamp trips interval
#1  a 2020-01-18 06:43:38     1       NA
#2  b 2020-01-18 06:44:14     1       36
#3  c 2020-01-18 16:45:07     2    36053
#4  d 2020-01-18 16:46:07     2       60

您可以使用split根据trips拆分数据。

split(data, data$trips)

dplyr中使用相同的逻辑

library(dplyr)

data %>%
  mutate(interval = difftime(timestamp, lag(timestamp), "secs"),
         trips = cumsum(c(TRUE, interval[-1] > 28800))) %>%
  #To split the data
  #%>% group_split(trips)