按组(在R中)

时间:2016-12-12 03:06:47

标签: r

我有以下数据:

> a = data.frame(date = rep(c("20160101", "20160201", "20160301", "20160401"), 4),
+                person = c(rep("Bill", 4), rep("Jim", 4), rep("Sarah", 4), rep("Katie", 4)),
+                purchased_product = c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0),
+                ever_purchased_previously = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1))
> a
       date person purchased_product ever_purchased_previously
1  20160101   Bill                 0                         0
2  20160201   Bill                 0                         0
3  20160301   Bill                 1                         0
4  20160401   Bill                 0                         1
5  20160101    Jim                 0                         0
6  20160201    Jim                 0                         0
7  20160301    Jim                 0                         0
8  20160401    Jim                 1                         0
9  20160101  Sarah                 1                         0
10 20160201  Sarah                 0                         1
11 20160301  Sarah                 1                         1
12 20160401  Sarah                 1                         1
13 20160101  Katie                 0                         0
14 20160201  Katie                 1                         0
15 20160301  Katie                 0                         1
16 20160401  Katie                 0                         1

我希望计算purchase_product列中的ever_purchased_previously列,但需要按组进行计算(在这种情况下,通过" person")。另请注意,ever_purchased_previously仅在购买后的月份(即不在同一个月内)等于1。可以假设数据将按日期排序。

我一直在尝试提出一些解决方案,查看purchase_product = 1的最小日期,以及动物园套餐中的na.locf函数,但到目前为止还没有运气。

非常感谢任何帮助,谢谢。

2 个答案:

答案 0 :(得分:1)

如果purchased_product列仅包含0个元素,cummax是比na.locf更好的选择

#data.table way
library(data.table)
setDT(a)
a[, ever:=cummax(shift(purchased_product, fill=0)), by=person]

#dplyr way
library(dplyr)
a %>% 
  group_by(person) %>% 
  mutate(ever=cummax(lag(purchased_product, default=0)))

答案 1 :(得分:0)

您甚至不需要使用date列,因为它们按日期排序:

for(i in 1:nrow(a)){
  n <- min(which(a$person==a$person[i]))
  if(sum(a$purchased_product[n:i])>0) a$ever_purchased_previously[i] <- 1
  if(a$ever_purchased_previously[i]==1 & a$purchased_product[i]==1) a$ever_purchased_previously[i] <- 0
}

a
       date person purchased_product ever_purchased_previously
1  20160101   Bill                 0                         0
2  20160201   Bill                 0                         0
3  20160301   Bill                 1                         0
4  20160401   Bill                 0                         1
5  20160101    Jim                 0                         0
6  20160201    Jim                 0                         0
7  20160301    Jim                 0                         0
8  20160401    Jim                 1                         0
9  20160101  Sarah                 1                         0
10 20160201  Sarah                 0                         1
11 20160301  Sarah                 1                         0
12 20160401  Sarah                 1                         0
13 20160101  Katie                 0                         0
14 20160201  Katie                 1                         0
15 20160301  Katie                 0                         1
16 20160401  Katie                 0                         1