我一直在看过去的帖子,但似乎找不到符合我需要的东西。 目标:对于每个用户,我希望获得他们之前7天的活动平均值(不计算当前观察值)。有些人在该窗口内没有任何活动(可以),其他人将有很多活动。
我一直在使用dplyr来按用户分组,但无法弄清楚如何获取每个时间戳并捕获该时间戳之前所有活动的平均值,从而获得每个人的滚动平均值。这是一个很大的数据集,因此需要高效。我确信datatable可以做到这一点,但是我发现代码很难解释,尽管它要快得多。
User Stamp activity Score
1 2019-06-20 "Car" 4500
1 2019-06-18 "Car" 600
1 2019-06-15 "Walk" 650
1 2019-06-21 "Ride" 790
2 2019-06-21 "Car" 800
2 2019-06-23 "Car" 500
3 2019-06-11 "Walk" 900
4 2019-06-15 "Walk" 200
4 2019-06-12 "Walk" 900
需要成为这样的人。我们会根据时间戳为每个用户提供滚动比例和滚动方式,但不包括该时间戳记。
User Stamp activity Score proportion_walk mean_score
1 2019-06-20 "Car" 4500 .5 625
1 2019-06-18 "Car" 600 1 650
1 2019-06-15 "Walk" 650 0 0
1 2019-06-21 "Ride" 790 .33 1916.33
2 2019-06-21 "Car" 800 0 0
2 2019-06-23 "Car" 500 0 800
3 2019-06-11 "Walk" 900 0 0
4 2019-06-15 "Walk" 200 1 900
4 2019-06-12 "Walk" 900 1 900
答案 0 :(得分:1)
可以尝试:
library(data.table)
df <- setDT(df)[, Stamp := as.Date(Stamp)][
, `:=` (mean_score = sapply(Stamp,
function(x)
mean(Score[between(Stamp, x - 7, x - 1)])
),
proportion_walk = sapply(Stamp,
function(x)
round(mean(
activity[between(Stamp, x - 7, x - 1)] == 'Walk'
),2)
)
), by = User][
is.nan(mean_score), `:=` (mean_score = 0, proportion_walk = 0)]
输出:
User Stamp activity Score mean_score proportion_walk
1: 1 2019-06-20 Car 4500 625.000 0.50
2: 1 2019-06-18 Car 600 650.000 1.00
3: 1 2019-06-15 Walk 650 0.000 0.00
4: 1 2019-06-21 Ride 790 1916.667 0.33
5: 2 2019-06-21 Car 800 0.000 0.00
6: 2 2019-06-23 Car 500 800.000 0.00
7: 3 2019-06-11 Walk 900 0.000 0.00
8: 4 2019-06-15 Walk 200 900.000 1.00
9: 4 2019-06-12 Walk 900 0.000 0.00
对于proportion_walk
,我认为根据您的描述,您的输出中有错别字。否则请改写;例如,2019-06-20
不能有0.33,因为要落后2天,其中之一是Walk
。
答案 1 :(得分:0)
最后使用“注释”中的数据,按照指示的条件进行左自连接,对所有匹配的行取Score
的平均值,否则取0。
library(sqldf)
sqldf("select a.*,
coalesce(avg(b.Activity == 'Walk'), 0) as Proportion_Walk,
coalesce(avg(b.Score), 0) as Mean
from DF as a
left join DF as b on a.User = b.User and
b.Stamp < a.Stamp and b.Stamp >= a.Stamp - 7
group by a.rowid")
给予:
User Stamp activity Score Proportion_Walk Mean
1 1 2019-06-20 Car 4500 0.5000000 625.000
2 1 2019-06-18 Car 600 1.0000000 650.000
3 1 2019-06-15 Walk 650 0.0000000 0.000
4 1 2019-06-21 Ride 790 0.3333333 1916.667
5 2 2019-06-21 Car 800 0.0000000 0.000
6 2 2019-06-23 Car 500 0.0000000 800.000
7 3 2019-06-11 Walk 900 0.0000000 0.000
8 4 2019-06-15 Walk 200 1.0000000 900.000
9 4 2019-06-12 Walk 900 0.0000000 0.000
可复制形式的数据:
Lines <- 'User Stamp activity Score
1 2019-06-20 "Car" 4500
1 2019-06-18 "Car" 600
1 2019-06-15 "Walk" 650
1 2019-06-21 "Ride" 790
2 2019-06-21 "Car" 800
2 2019-06-23 "Car" 500
3 2019-06-11 "Walk" 900
4 2019-06-15 "Walk" 200
4 2019-06-12 "Walk" 900'
DF <- read.table(text = Lines, header = TRUE)
DF$Stamp <- as.Date(DF$Stamp)
答案 2 :(得分:0)
library("dplyr")
library("purr")
DF %>%
group_by(User) %>%
mutate(mean_score = map_dbl(Stamp,
~mean(Score[(Stamp > . - 7) & (Stamp < .)]))) %>%
mutate(mean_score =ifelse(is.nan(mean_score), 0, mean_score))
User Stamp activity Score mean_score
<int> <date> <fct> <int> <dbl>
1 1 2019-06-20 Car 4500 625
2 1 2019-06-18 Car 600 650
3 1 2019-06-15 Walk 650 0
4 1 2019-06-21 Ride 790 1917.
5 2 2019-06-21 Car 800 0
6 2 2019-06-23 Car 500 800
7 3 2019-06-11 Walk 900 0
8 4 2019-06-15 Walk 200 900
9 4 2019-06-12 Walk 900 0