如何根据过去7天的活动为每个用户创建滚动平均值?

时间:2019-06-21 21:44:33

标签: r group-by data.table time-series rolling-computation

我一直在看过去的帖子,但似乎找不到符合我需要的东西。 目标:对于每个用户,我希望获得他们之前7天的活动平均值(不计算当前观察值)。有些人在该窗口内没有任何活动(可以),其他人将有很多活动。

我一直在使用dplyr来按用户分组,但无法弄清楚如何获取每个时间戳并捕获该时间戳之前所有活动的平均值,从而获得每个人的滚动平均值。这是一个很大的数据集,因此需要高效。我确信datatable可以做到这一点,但是我发现代码很难解释,尽管它要快得多。

User  Stamp          activity   Score    

1     2019-06-20     "Car"      4500
1     2019-06-18     "Car"      600
1     2019-06-15     "Walk"     650
1     2019-06-21     "Ride"     790
2     2019-06-21     "Car"      800    
2     2019-06-23     "Car"      500
3     2019-06-11     "Walk"     900
4     2019-06-15     "Walk"     200   
4     2019-06-12     "Walk"     900

需要成为这样的人。我们会根据时间戳为每个用户提供滚动比例和滚动方式,但不包括该时间戳记。

User  Stamp          activity   Score   proportion_walk   mean_score    

1     2019-06-20     "Car"      4500    .5                 625
1     2019-06-18     "Car"      600     1                  650
1     2019-06-15     "Walk"     650     0                   0
1     2019-06-21     "Ride"     790     .33                   1916.33
2     2019-06-21     "Car"      800     0                   0
2     2019-06-23     "Car"      500     0                   800
3     2019-06-11     "Walk"     900     0                   0
4     2019-06-15     "Walk"     200     1                   900
4     2019-06-12     "Walk"     900     1                   900

3 个答案:

答案 0 :(得分:1)

可以尝试:

library(data.table)

df <- setDT(df)[, Stamp := as.Date(Stamp)][
  , `:=` (mean_score = sapply(Stamp, 
                              function(x) 
                                mean(Score[between(Stamp, x - 7, x - 1)])
  ),
  proportion_walk = sapply(Stamp, 
                           function(x) 
                             round(mean(
                               activity[between(Stamp, x - 7, x - 1)] == 'Walk'
                               ),2)
  )
  ), by = User][
    is.nan(mean_score), `:=` (mean_score = 0, proportion_walk = 0)]

输出:

   User      Stamp activity Score mean_score proportion_walk
1:    1 2019-06-20      Car  4500    625.000            0.50
2:    1 2019-06-18      Car   600    650.000            1.00
3:    1 2019-06-15     Walk   650      0.000            0.00
4:    1 2019-06-21     Ride   790   1916.667            0.33
5:    2 2019-06-21      Car   800      0.000            0.00
6:    2 2019-06-23      Car   500    800.000            0.00
7:    3 2019-06-11     Walk   900      0.000            0.00
8:    4 2019-06-15     Walk   200    900.000            1.00
9:    4 2019-06-12     Walk   900      0.000            0.00

对于proportion_walk,我认为根据您的描述,您的输出中有错别字。否则请改写;例如,2019-06-20不能有0.33,因为要落后2天,其中之一是Walk

答案 1 :(得分:0)

最后使用“注释”中的数据,按照指示的条件进行左自连接,对所有匹配的行取Score的平均值,否则取0。

library(sqldf)

sqldf("select a.*, 
         coalesce(avg(b.Activity == 'Walk'), 0) as Proportion_Walk, 
         coalesce(avg(b.Score), 0) as Mean
  from DF as a 
  left join DF as b on a.User = b.User and 
                       b.Stamp < a.Stamp and b.Stamp >= a.Stamp - 7
  group by a.rowid")

给予:

  User      Stamp activity Score Proportion_Walk     Mean
1    1 2019-06-20      Car  4500       0.5000000  625.000
2    1 2019-06-18      Car   600       1.0000000  650.000
3    1 2019-06-15     Walk   650       0.0000000    0.000
4    1 2019-06-21     Ride   790       0.3333333 1916.667
5    2 2019-06-21      Car   800       0.0000000    0.000
6    2 2019-06-23      Car   500       0.0000000  800.000
7    3 2019-06-11     Walk   900       0.0000000    0.000
8    4 2019-06-15     Walk   200       1.0000000  900.000
9    4 2019-06-12     Walk   900       0.0000000    0.000

注意

可复制形式的数据:

Lines <- 'User  Stamp          activity   Score    
1     2019-06-20     "Car"      4500
1     2019-06-18     "Car"      600
1     2019-06-15     "Walk"     650
1     2019-06-21     "Ride"     790
2     2019-06-21     "Car"      800    
2     2019-06-23     "Car"      500
3     2019-06-11     "Walk"     900
4     2019-06-15     "Walk"     200   
4     2019-06-12     "Walk"     900'

DF <- read.table(text = Lines, header = TRUE)
DF$Stamp <- as.Date(DF$Stamp)

答案 2 :(得分:0)

library("dplyr")
library("purr")
DF %>%
  group_by(User) %>%
  mutate(mean_score = map_dbl(Stamp, 
                              ~mean(Score[(Stamp > . - 7) & (Stamp < .)]))) %>%
  mutate(mean_score =ifelse(is.nan(mean_score), 0, mean_score))

输出

  User Stamp      activity Score mean_score
  <int> <date>     <fct>    <int>      <dbl>
1     1 2019-06-20 Car       4500       625 
2     1 2019-06-18 Car        600       650 
3     1 2019-06-15 Walk       650         0 
4     1 2019-06-21 Ride       790      1917.
5     2 2019-06-21 Car        800         0 
6     2 2019-06-23 Car        500       800 
7     3 2019-06-11 Walk       900         0 
8     4 2019-06-15 Walk       200       900 
9     4 2019-06-12 Walk       900         0