我正在寻找基于时间戳的数据聚合方法。 这是我的示例数据:
new <- data.table( date = as.POSIXct( c( "2016-03-06 12:23:00", "2016-03-07 12:21:00", "2016-03-08 12:26:00" ,"2016-03-09 12:30:00","2016-03-10 12:50:00",
"2016-03-06 12:20:00","2016-03-07 12:20:00","2016-03-08 12:20:00","2016-03-09 12:20:00","2016-03-10 12:20:00")
), data.count = c( 1,7,10,15,12,11,23,35,21,11 ) )
我想要实现的是data.count
按每个日期和前两个日期(或前n
个日期)分组的数量,因为样本很小所以我选择了前两个日期) :
date previous_date count
2016-03-09 2016-03-07 30
2016-03-09 2016-03-08 45
2016-03-09 2016-03-09 36
2016-03-10 2016-03-08 45
2016-03-10 2016-03-09 36
2016-03-10 2016-03-10 33
因此示例输出与2016-03-10
类似,我们有三行,一行是2016-03-10
的计数,另外两行是其前一个日期2016-03-09, 2016-03-08
。
尝试一次:
这是我的第一个方法:
for (i in 1:length(unique(as.Date(new$date))))
{
assign(paste0(unique(as.Date(new$date))[i]), new%>%
group_by(unique(as.Date(new$date))[i],as.Date(new$date)) %>%
dplyr::summarise(count= sum(data.count))%>%
filter(.[[1]] > .[[2]]))
}
这为我提供了每个唯一日期,其group_by包含数据集中的所有其他唯一日期。
但是我还没弄明白如何将结果限制在last 2
或last n
个日期。
尝试两次
这是我的第二次尝试,我能够最终找到我想要的结果:
for (i in 1:length(unique(as.Date(new$date))))
{
assign(paste0(unique(as.Date(new$date))[i]), new%>%
group_by(unique(as.Date(new$date))[i],as.Date(new$date)) %>%
dplyr::summarise(count= sum(data.count))%>%
filter(.[[1]] >= .[[2]])%>%
arrange(desc(.[[2]]))%>%
top_n(3))
}
然而,2016-03-10
日期似乎有一个问题:
2016-03-10 2016-03-09 36
2016-03-10 2016-03-08 45
2016-03-10 2016-03-07 30
它没有回复:
2016-03-10 2016-03-10 23
2016-03-10 2016-03-09 36
2016-03-10 2016-03-08 45
top_n()
似乎存在总和问题
new%>%
group_by(unique(as.Date(new$date))[5],as.Date(new$date)) %>%
dplyr::summarise(count= sum(data.count))%>%
filter(.[[1]] >= .[[2]])%>%
arrange(desc(.[[2]]))
这让我回答:
2016-03-10 2016-03-10 23
2016-03-10 2016-03-09 36
2016-03-10 2016-03-08 45
2016-03-10 2016-03-07 30
2016-03-10 2016-03-06 12
哪个非常好,只是这个没有返回的top_n()
2016-03-10 2016-03-10 23
理想情况下应该这样,因为这与其他日期完全一致。 如果你能在这里找出问题,请告诉我。
答案 0 :(得分:0)
我必须承认,我对改变问题后的预期结果感到非常困惑(并且您提到的预期结果不是我得到的,但也许您只更新了new
中的数据而没有调整结果...)。
然而,我猜这就是你想要的(你可以配置第2步的前几天):
require(data.table)
# 1. Aggregate counts per date (ignoring the time)
new[, date.only := as.Date(date)] # get rid of time
new.agg <- new[, .(count = .N, sum = sum(data.count) ), by = date.only]
# 2. Calculate lower date boundary
number.of.prev.days <- 2 # configure your date range (bucket size) here
new.agg[, previous_date := as.Date(date.only - number.of.prev.days)]
# 3. Create one row per date that contributes to a date bucket
new.agg.items <- new.agg[new.agg, .(date.only = x.date.only, sum), on = .(date.only <= date.only, date.only >= previous_date), by = .EACHI]
# col names cosmetics required due to counter-intuitive non-equi join column names...
# See: https://github.com/Rdatatable/data.table/issues/1700
setnames(new.agg.items, 1, "bucket.from.date")
setnames(new.agg.items, 2, "bucket.to.date")
# 4. order descending (required for a cumsum going back in time!)
new.agg.items <- new.agg.items[order(bucket.from.date, -date.only)]
# 5. Cumsum over the groups (date buckets) finally
result <- new.agg.items[, .(from.date = date.only, sum = cumsum(sum)), by = .(bucket.from.date, bucket.to.date)]
导致
> new.agg
date.only count sum previous_date
1: 2016-03-06 2 12 2016-03-04
2: 2016-03-07 2 30 2016-03-05
3: 2016-03-08 2 45 2016-03-06
4: 2016-03-09 2 36 2016-03-07
5: 2016-03-10 2 23 2016-03-08
> new.agg.items
bucket.from.date bucket.to.date date.only sum
1: 2016-03-06 2016-03-04 2016-03-06 12
2: 2016-03-07 2016-03-05 2016-03-07 30
3: 2016-03-07 2016-03-05 2016-03-06 12
4: 2016-03-08 2016-03-06 2016-03-08 45
5: 2016-03-08 2016-03-06 2016-03-07 30
6: 2016-03-08 2016-03-06 2016-03-06 12
7: 2016-03-09 2016-03-07 2016-03-09 36
8: 2016-03-09 2016-03-07 2016-03-08 45
9: 2016-03-09 2016-03-07 2016-03-07 30
10: 2016-03-10 2016-03-08 2016-03-10 23
11: 2016-03-10 2016-03-08 2016-03-09 36
12: 2016-03-10 2016-03-08 2016-03-08 45
> result
bucket.from.date bucket.to.date from.date sum
1: 2016-03-06 2016-03-04 2016-03-06 12
2: 2016-03-07 2016-03-05 2016-03-07 30
3: 2016-03-07 2016-03-05 2016-03-06 42
4: 2016-03-08 2016-03-06 2016-03-08 45
5: 2016-03-08 2016-03-06 2016-03-07 75
6: 2016-03-08 2016-03-06 2016-03-06 87
7: 2016-03-09 2016-03-07 2016-03-09 36
8: 2016-03-09 2016-03-07 2016-03-08 81
9: 2016-03-09 2016-03-07 2016-03-07 111
10: 2016-03-10 2016-03-08 2016-03-10 23
11: 2016-03-10 2016-03-08 2016-03-09 59
12: 2016-03-10 2016-03-08 2016-03-08 104
如果您更喜欢使用与OP请求相同的列和名称的结果:
# Optionally: Rename and choose columns as the OP asked for
result[, .(date = bucket.from.date, previous_date = from.date, count = sum)]
获取
date previous_date count
1: 2016-03-06 2016-03-06 12
2: 2016-03-07 2016-03-07 30
3: 2016-03-07 2016-03-06 42
4: 2016-03-08 2016-03-08 45
5: 2016-03-08 2016-03-07 75
6: 2016-03-08 2016-03-06 87
7: 2016-03-09 2016-03-09 36
8: 2016-03-09 2016-03-08 81
9: 2016-03-09 2016-03-07 111
10: 2016-03-10 2016-03-10 23
11: 2016-03-10 2016-03-09 59
12: 2016-03-10 2016-03-08 104