基于时间戳回溯的数据聚合

时间:2017-04-22 15:30:30

标签: r data.table dplyr sqldf

我正在寻找基于时间戳的数据聚合方法。 这是我的示例数据:

new <- data.table( date = as.POSIXct( c( "2016-03-06 12:23:00", "2016-03-07 12:21:00", "2016-03-08 12:26:00" ,"2016-03-09 12:30:00","2016-03-10 12:50:00",
                                         "2016-03-06 12:20:00","2016-03-07 12:20:00","2016-03-08 12:20:00","2016-03-09 12:20:00","2016-03-10 12:20:00")   
), data.count = c( 1,7,10,15,12,11,23,35,21,11 ) )

我想要实现的是data.count按每个日期和前两个日期(或前n个日期)分组的数量,因为样本很小所以我选择了前两个日期) :

 date     previous_date  count
2016-03-09 2016-03-07       30
2016-03-09 2016-03-08       45
2016-03-09 2016-03-09       36
2016-03-10 2016-03-08       45
2016-03-10 2016-03-09       36
2016-03-10 2016-03-10       33

因此示例输出与2016-03-10类似,我们有三行,一行是2016-03-10的计数,另外两行是其前一个日期2016-03-09, 2016-03-08

尝试一次:

这是我的第一个方法:

for (i in 1:length(unique(as.Date(new$date))))
{
  assign(paste0(unique(as.Date(new$date))[i]), new%>%
  group_by(unique(as.Date(new$date))[i],as.Date(new$date)) %>%
    dplyr::summarise(count= sum(data.count))%>% 
     filter(.[[1]] > .[[2]]))

}

这为我提供了每个唯一日期,其group_by包含数据集中的所有其他唯一日期。 但是我还没弄明白如何将结果限制在last 2last n个日期。

尝试两次

这是我的第二次尝试,我能够最终找到我想要的结果:

for (i in 1:length(unique(as.Date(new$date))))
{

    assign(paste0(unique(as.Date(new$date))[i]), new%>%
             group_by(unique(as.Date(new$date))[i],as.Date(new$date)) %>%
             dplyr::summarise(count= sum(data.count))%>% 
             filter(.[[1]] >= .[[2]])%>%
             arrange(desc(.[[2]]))%>%
             top_n(3))

}

然而,2016-03-10日期似乎有一个问题:

2016-03-10          2016-03-09    36
2016-03-10          2016-03-08    45
2016-03-10          2016-03-07    30

它没有回复:

2016-03-10          2016-03-10    23
2016-03-10          2016-03-09    36
2016-03-10          2016-03-08    45

top_n()似乎存在总和问题

 new%>%
     group_by(unique(as.Date(new$date))[5],as.Date(new$date)) %>%
     dplyr::summarise(count= sum(data.count))%>% 
     filter(.[[1]] >= .[[2]])%>%
     arrange(desc(.[[2]]))

这让我回答:

2016-03-10          2016-03-10    23
2016-03-10          2016-03-09    36
2016-03-10          2016-03-08    45
2016-03-10          2016-03-07    30
2016-03-10          2016-03-06    12

哪个非常好,只是这个没有返回的top_n()

2016-03-10          2016-03-10    23 

理想情况下应该这样,因为这与其他日期完全一致。 如果你能在这里找出问题,请告诉我。

1 个答案:

答案 0 :(得分:0)

我必须承认,我对改变问题后的预期结果感到非常困惑(并且您提到的预期结果不是我得到的,但也许您只更新了new中的数据而没有调整结果...)。

然而,我猜这就是你想要的(你可以配置第2步的前几天):

require(data.table)

# 1. Aggregate counts per date (ignoring the time)
new[, date.only := as.Date(date)]  # get rid of time
new.agg <- new[, .(count = .N, sum = sum(data.count) ), by = date.only]

# 2. Calculate lower date boundary
number.of.prev.days <- 2  # configure your date range (bucket size) here
new.agg[, previous_date := as.Date(date.only - number.of.prev.days)]

# 3. Create one row per date that contributes to a date bucket
new.agg.items <- new.agg[new.agg, .(date.only = x.date.only, sum), on = .(date.only <= date.only, date.only >= previous_date), by = .EACHI]

# col names cosmetics required due to counter-intuitive non-equi join column names...
# See: https://github.com/Rdatatable/data.table/issues/1700
setnames(new.agg.items, 1, "bucket.from.date")
setnames(new.agg.items, 2, "bucket.to.date")

# 4. order descending (required for a cumsum going back in time!)
new.agg.items <- new.agg.items[order(bucket.from.date, -date.only)]

# 5. Cumsum over the groups (date buckets) finally
result <- new.agg.items[, .(from.date = date.only, sum = cumsum(sum)), by = .(bucket.from.date, bucket.to.date)]

导致

> new.agg
    date.only count sum previous_date
1: 2016-03-06     2  12    2016-03-04
2: 2016-03-07     2  30    2016-03-05
3: 2016-03-08     2  45    2016-03-06
4: 2016-03-09     2  36    2016-03-07
5: 2016-03-10     2  23    2016-03-08

> new.agg.items
    bucket.from.date bucket.to.date  date.only sum
 1:       2016-03-06     2016-03-04 2016-03-06  12
 2:       2016-03-07     2016-03-05 2016-03-07  30
 3:       2016-03-07     2016-03-05 2016-03-06  12
 4:       2016-03-08     2016-03-06 2016-03-08  45
 5:       2016-03-08     2016-03-06 2016-03-07  30
 6:       2016-03-08     2016-03-06 2016-03-06  12
 7:       2016-03-09     2016-03-07 2016-03-09  36
 8:       2016-03-09     2016-03-07 2016-03-08  45
 9:       2016-03-09     2016-03-07 2016-03-07  30
10:       2016-03-10     2016-03-08 2016-03-10  23
11:       2016-03-10     2016-03-08 2016-03-09  36
12:       2016-03-10     2016-03-08 2016-03-08  45

> result
    bucket.from.date bucket.to.date  from.date sum
 1:       2016-03-06     2016-03-04 2016-03-06  12
 2:       2016-03-07     2016-03-05 2016-03-07  30
 3:       2016-03-07     2016-03-05 2016-03-06  42
 4:       2016-03-08     2016-03-06 2016-03-08  45
 5:       2016-03-08     2016-03-06 2016-03-07  75
 6:       2016-03-08     2016-03-06 2016-03-06  87
 7:       2016-03-09     2016-03-07 2016-03-09  36
 8:       2016-03-09     2016-03-07 2016-03-08  81
 9:       2016-03-09     2016-03-07 2016-03-07 111
10:       2016-03-10     2016-03-08 2016-03-10  23
11:       2016-03-10     2016-03-08 2016-03-09  59
12:       2016-03-10     2016-03-08 2016-03-08 104

如果您更喜欢使用与OP请求相同的列和名称的结果:

# Optionally: Rename and choose columns as the OP asked for
result[, .(date = bucket.from.date, previous_date = from.date, count = sum)]

获取

         date previous_date count
 1: 2016-03-06    2016-03-06    12
 2: 2016-03-07    2016-03-07    30
 3: 2016-03-07    2016-03-06    42
 4: 2016-03-08    2016-03-08    45
 5: 2016-03-08    2016-03-07    75
 6: 2016-03-08    2016-03-06    87
 7: 2016-03-09    2016-03-09    36
 8: 2016-03-09    2016-03-08    81
 9: 2016-03-09    2016-03-07   111
10: 2016-03-10    2016-03-10    23
11: 2016-03-10    2016-03-09    59
12: 2016-03-10    2016-03-08   104