根据条件按时间序列运行计数记录

时间:2019-03-26 13:38:40

标签: r data.table

我有一个带有时间戳和价格的玩具数据集,如下所示:

time <- c(as.POSIXlt("2017-02-03 09:00:01"),
        as.POSIXlt("2017-02-03 09:00:03"),
        as.POSIXlt("2017-02-03 09:00:06"),
        as.POSIXlt("2017-02-03 09:00:09"),
        as.POSIXlt("2017-02-03 09:00:10"),
       as.POSIXlt("2017-02-03 09:00:20"),
       as.POSIXlt("2017-02-03 09:00:23"),
       as.POSIXlt("2017-02-03 09:00:34"),
       as.POSIXlt("2017-02-03 09:00:44"),
as.POSIXlt("2017-02-03 09:01:07"))

price <- c(100, 100, 100, 99, 98, 99, 100, 101, 101, 100)

data <- data.frame(time, price)

我需要遍历data.frame,以相同的价格查找连续的记录序列,计算这些序列中的记录数量,并计算从相同价格的第一个成员到最后一个成员的持续时间(以秒为单位)顺序。

因此,对于上面的示例,结果是:

start, end, price, nbr_records, duration_sec
2017-02-03 09:00:01, 2017-02-03 09:00:03, 100, 3, 5
2017-02-03 09:00:09, 2017-02-03 09:00:09, 99, 1, 0
2017-02-03 09:00:10, 2017-02-03 09:00:10, 98, 1, 0
2017-02-03 09:00:20, 2017-02-03 09:00:20, 99, 1, 0
2017-02-03 09:00:23, 2017-02-03 09:00:23, 100, 1, 0
2017-02-03 09:00:34, 2017-02-03 09:00:44, 101, 2, 10
2017-02-03 09:01:07, 2017-02-03 09:01:07, 100, 1, 0

最好有一个快速的data.table解决方案,因为我有很多记录。 谢谢!

1 个答案:

答案 0 :(得分:1)

我删除了我的评论,第二遍阅读后我明白了你要做什么。

使用rleid()中的data.table

非常简单
library(data.table)

## Note: store times as POSIXct instead of POSIXlt for drastic performance improvement
time <- c(as.POSIXct("2017-02-03 09:00:01"),
          as.POSIXct("2017-02-03 09:00:03"),
          as.POSIXct("2017-02-03 09:00:06"),
          as.POSIXct("2017-02-03 09:00:09"),
          as.POSIXct("2017-02-03 09:00:10"),
          as.POSIXct("2017-02-03 09:00:20"),
          as.POSIXct("2017-02-03 09:00:23"),
          as.POSIXct("2017-02-03 09:00:34"),
          as.POSIXct("2017-02-03 09:00:44"),
          as.POSIXct("2017-02-03 09:01:07"))

price <- c(100, 100, 100, 99, 98, 99, 100, 101, 101, 100)

data <- data.frame(time, price)

## Convert to a data.table
setDT(data)

## Create a summary using a generated counter on the fly with 
## the `rleid` function from data.table to group consecutive
## sequences together and then operate by group. the `.N`
## operator is another special symbol in data.table
## that we can use to return the number of rows in each group
## here. See ?special-symbols to learn more
Summary <- data[, .(start = first(time),
                    end = last(time),
                    nbr_records = .N,
                    duration_sec = as.numeric(last(time)) - as.numeric(first(time))
                    ), by = .(Counter = data.table::rleid(price))]

## Drop the Counter variable assuming you don't need it
Summary[,Counter := NULL]

## Results
print(Summary)

#                  start                 end nbr_records duration_sec
# 1: 2017-02-03 09:00:01 2017-02-03 09:00:06           3            5
# 2: 2017-02-03 09:00:09 2017-02-03 09:00:09           1            0
# 3: 2017-02-03 09:00:10 2017-02-03 09:00:10           1            0
# 4: 2017-02-03 09:00:20 2017-02-03 09:00:20           1            0
# 5: 2017-02-03 09:00:23 2017-02-03 09:00:23           1            0
# 6: 2017-02-03 09:00:34 2017-02-03 09:00:44           2           10
# 7: 2017-02-03 09:01:07 2017-02-03 09:01:07           1            0