我有一个带有ID和日期列的数据框。我希望计算一个组的date1和下一个日期之间的天差。
我已经尝试过dplyr软件包,这似乎是错误的。
hist_trnx1 %>% group_by(card_id) %>% mutate(gap=round(c(NA,diff(purchase_date)), 1))
我想得到如下结果
Card_ID date Diff
1. C_ID_4e6213e9bc 2017-06-25 15:33:07 NA
2: C_ID_4e6213e9bc 2017-07-15 12:10:45 20
3: C_ID_4e6213e9bc 2017-08-09 22:04:29 34
4: C_ID_4e6213e9bB 2017-03-10 10:06:26 NA #( Because of group change)
5: C_ID_4e6213e9bB 2017-04-10 01:14:19 30
6: C_ID_4e6213e9bD 2018-02-24 08:45:05 NA #( Because of group change )
7: C_ID_4e6213e9bD 2018-03-23 08:45:05 29
数据
structure(list(card_id = c("C_ID_4e6213e9bc", "C_ID_4e6213e9bc",
"C_ID_4e6213e9bc", "C_ID_4e6213e9bc", "C_ID_4e6213e9bc", "C_ID_4e6213e9bc"
), purchase_date = structure(c(1498404787, 1500120645, 1502316269,
1504346786, 1489108459, 1519461905), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), .Names = c("card_id", "purchase_date"), class = c("data.table",
"data.frame"), row.names = c(NA, -6L))
答案 0 :(得分:1)
我不确定这是否是最漂亮的方法,有人可能会提供更干净的解决方案,但这应该可以工作(解决方案的一部分位于subtract value from previous row by group)
首先,我导入您的数据:
df <- structure(list(card_id = c("C_ID_4e6213e9bc", "C_ID_4e6213e9bc", "C_ID_4e6213e9bB", "C_ID_4e6213e9B",
"C_ID_4e6213e9bD", "C_ID_4e6213e9bD" ),
purchase_date = structure(c(1498404787, 1500120645, 1502316269, 1504346786, 1489108459, 1519461905),
tzone = "UTC", class = c("POSIXct", "POSIXt"))),
.Names = c("card_id", "purchase_date"), class = c("data.table", "data.frame"),
row.names = c(NA, -6L))
然后它在我运行时起作用:
df <- df %>%
group_by(card_id) %>%
arrange(purchase_date) %>%
mutate(diff = purchase_date - lag(purchase_date, default = first(purchase_date))) %>%
mutate(diff = round(diff/86400, digits = 2))
排列可让您确定要减去的是要减去的内容,然后 lag 函数可让您选择上一行,最后选择除法返回花费的天数。
希望对您有帮助=)