我有以下数据:
> a = data.frame(date = rep(c("20160101", "20160201", "20160301", "20160401"), 4),
+ person = c(rep("Bill", 4), rep("Jim", 4), rep("Sarah", 4), rep("Katie", 4)),
+ purchased_product = c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0),
+ ever_purchased_previously = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1))
> a
date person purchased_product ever_purchased_previously
1 20160101 Bill 0 0
2 20160201 Bill 0 0
3 20160301 Bill 1 0
4 20160401 Bill 0 1
5 20160101 Jim 0 0
6 20160201 Jim 0 0
7 20160301 Jim 0 0
8 20160401 Jim 1 0
9 20160101 Sarah 1 0
10 20160201 Sarah 0 1
11 20160301 Sarah 1 1
12 20160401 Sarah 1 1
13 20160101 Katie 0 0
14 20160201 Katie 1 0
15 20160301 Katie 0 1
16 20160401 Katie 0 1
我希望计算purchase_product列中的ever_purchased_previously列,但需要按组进行计算(在这种情况下,通过" person")。另请注意,ever_purchased_previously仅在购买后的月份(即不在同一个月内)等于1。可以假设数据将按日期排序。
我一直在尝试提出一些解决方案,查看purchase_product = 1的最小日期,以及动物园套餐中的na.locf函数,但到目前为止还没有运气。
非常感谢任何帮助,谢谢。
答案 0 :(得分:1)
如果purchased_product
列仅包含0个元素,cummax
是比na.locf
更好的选择
#data.table way
library(data.table)
setDT(a)
a[, ever:=cummax(shift(purchased_product, fill=0)), by=person]
#dplyr way
library(dplyr)
a %>%
group_by(person) %>%
mutate(ever=cummax(lag(purchased_product, default=0)))
答案 1 :(得分:0)
您甚至不需要使用date
列,因为它们按日期排序:
for(i in 1:nrow(a)){
n <- min(which(a$person==a$person[i]))
if(sum(a$purchased_product[n:i])>0) a$ever_purchased_previously[i] <- 1
if(a$ever_purchased_previously[i]==1 & a$purchased_product[i]==1) a$ever_purchased_previously[i] <- 0
}
a
date person purchased_product ever_purchased_previously 1 20160101 Bill 0 0 2 20160201 Bill 0 0 3 20160301 Bill 1 0 4 20160401 Bill 0 1 5 20160101 Jim 0 0 6 20160201 Jim 0 0 7 20160301 Jim 0 0 8 20160401 Jim 1 0 9 20160101 Sarah 1 0 10 20160201 Sarah 0 1 11 20160301 Sarah 1 0 12 20160401 Sarah 1 0 13 20160101 Katie 0 0 14 20160201 Katie 1 0 15 20160301 Katie 0 1 16 20160401 Katie 0 1