如何计算最近n天的唯一行

时间:2018-10-08 05:38:05

标签: r tidyverse

说我想每天计算最近15天的唯一ID。这是代码:

library(tidyverse)
library(lubridate)
set.seed(1)
eg <- tibble(day = sample(seq(ymd('2018-01-01'), length.out = 100, by = 'day'), 300, replace = T),
             id = sample(letters[1:26], 300, replace = T),
             value = rnorm(300))

eg %>% 
  group_by(day) %>% 
  summarise(uniqu_id = n_distinct(id),
            recent_15_days_unique_id = 'howto',
            day_total = sum(value))

结果是

# A tibble: 95 x 4
   day        uniqu_id recent_15_days_unique_id day_total
   <date>        <int> <chr>                        <dbl>
 1 2018-01-01        3 how                         -1.38 
 2 2018-01-02        3 how                          2.01 
 3 2018-01-03        3 how                          1.57 
 4 2018-01-04        6 how                         -1.64 
 5 2018-01-05        2 how                         -0.293
 6 2018-01-06        4 how                         -2.08 

对于“ recent_15_days_unique_id”列,第一行用于计算“ day-15”到“ day”(即“ 2017-12-17”和“ 2018-01-01”)之间的唯一ID,第二行介于'2017-12-18'和'2018-01-02'。有点像'rollsum'函数,但用于计数。

1 个答案:

答案 0 :(得分:1)

我们可以ungroup,每天可以创建一个15天的序列,并计算该持续时间中所有唯一的id

library(dplyr)

eg %>% 
   group_by(day) %>% 
   summarise(uniqu_id = n_distinct(id),
             day_total = sum(value)) %>%
   ungroup() %>%
   rowwise() %>%
   mutate(recent_15_days_unique_id = 
    n_distinct(eg$id[eg$day %in% seq(day - 15, day, by = "1 day")]))



 #   day        uniqu_id day_total recent_15_days_unique_id
 # <date>        <int>     <dbl>                    <int>
 #1 2018-01-02        2    0.170                         2
 #2 2018-01-03        2   -0.460                         3
 #3 2018-01-04        1   -1.53                          3
 #4 2018-01-05        2    1.67                          5
 #5 2018-01-06        2    1.52                          6
 #6 2018-01-07        4   -1.62                         10
 #7 2018-01-08        2   -0.0190                       12
 #8 2018-01-09        1   -0.573                        12
 #9 2018-01-10        2   -0.220                        13
#10 2018-01-11        7   -1.73                         14

使用相同的逻辑,我们还可以使用sapply

对其进行单独计算
new_eg <- eg %>% 
         group_by(day) %>% 
         summarise(uniqu_id = n_distinct(id),
                   day_total = sum(value)) %>%
         ungroup()


sapply(new_eg$day, function(x) 
   n_distinct(eg$id[as.numeric(eg$day) %in% seq(x-15, x, by = "1 day")]))

#[1]  2  3  3  5  6 10 12 12 13 14 15 16 17 17 18 20 21 22 22 20 20 21 21 .....