我有一个带有时间戳和价格的玩具数据集,如下所示:
time <- c(as.POSIXlt("2017-02-03 09:00:01"),
as.POSIXlt("2017-02-03 09:00:03"),
as.POSIXlt("2017-02-03 09:00:06"),
as.POSIXlt("2017-02-03 09:00:09"),
as.POSIXlt("2017-02-03 09:00:10"),
as.POSIXlt("2017-02-03 09:00:20"),
as.POSIXlt("2017-02-03 09:00:23"),
as.POSIXlt("2017-02-03 09:00:34"),
as.POSIXlt("2017-02-03 09:00:44"),
as.POSIXlt("2017-02-03 09:01:07"))
price <- c(100, 100, 100, 99, 98, 99, 100, 101, 101, 100)
data <- data.frame(time, price)
我需要遍历data.frame,以相同的价格查找连续的记录序列,计算这些序列中的记录数量,并计算从相同价格的第一个成员到最后一个成员的持续时间(以秒为单位)顺序。
因此,对于上面的示例,结果是:
start, end, price, nbr_records, duration_sec
2017-02-03 09:00:01, 2017-02-03 09:00:03, 100, 3, 5
2017-02-03 09:00:09, 2017-02-03 09:00:09, 99, 1, 0
2017-02-03 09:00:10, 2017-02-03 09:00:10, 98, 1, 0
2017-02-03 09:00:20, 2017-02-03 09:00:20, 99, 1, 0
2017-02-03 09:00:23, 2017-02-03 09:00:23, 100, 1, 0
2017-02-03 09:00:34, 2017-02-03 09:00:44, 101, 2, 10
2017-02-03 09:01:07, 2017-02-03 09:01:07, 100, 1, 0
最好有一个快速的data.table解决方案,因为我有很多记录。 谢谢!
答案 0 :(得分:1)
我删除了我的评论,第二遍阅读后我明白了你要做什么。
使用rleid()
中的data.table
library(data.table)
## Note: store times as POSIXct instead of POSIXlt for drastic performance improvement
time <- c(as.POSIXct("2017-02-03 09:00:01"),
as.POSIXct("2017-02-03 09:00:03"),
as.POSIXct("2017-02-03 09:00:06"),
as.POSIXct("2017-02-03 09:00:09"),
as.POSIXct("2017-02-03 09:00:10"),
as.POSIXct("2017-02-03 09:00:20"),
as.POSIXct("2017-02-03 09:00:23"),
as.POSIXct("2017-02-03 09:00:34"),
as.POSIXct("2017-02-03 09:00:44"),
as.POSIXct("2017-02-03 09:01:07"))
price <- c(100, 100, 100, 99, 98, 99, 100, 101, 101, 100)
data <- data.frame(time, price)
## Convert to a data.table
setDT(data)
## Create a summary using a generated counter on the fly with
## the `rleid` function from data.table to group consecutive
## sequences together and then operate by group. the `.N`
## operator is another special symbol in data.table
## that we can use to return the number of rows in each group
## here. See ?special-symbols to learn more
Summary <- data[, .(start = first(time),
end = last(time),
nbr_records = .N,
duration_sec = as.numeric(last(time)) - as.numeric(first(time))
), by = .(Counter = data.table::rleid(price))]
## Drop the Counter variable assuming you don't need it
Summary[,Counter := NULL]
## Results
print(Summary)
# start end nbr_records duration_sec
# 1: 2017-02-03 09:00:01 2017-02-03 09:00:06 3 5
# 2: 2017-02-03 09:00:09 2017-02-03 09:00:09 1 0
# 3: 2017-02-03 09:00:10 2017-02-03 09:00:10 1 0
# 4: 2017-02-03 09:00:20 2017-02-03 09:00:20 1 0
# 5: 2017-02-03 09:00:23 2017-02-03 09:00:23 1 0
# 6: 2017-02-03 09:00:34 2017-02-03 09:00:44 2 10
# 7: 2017-02-03 09:01:07 2017-02-03 09:01:07 1 0