我有一个如下所示的数据框:
WC ASN TS Date Loss
7101 3 R 13-07-12 156.930422
7101 3 R 02-08-12 168.401876
7101 4 R 28-12-13 120.492081
7101 4 R 16-10-15 46.012085
7101 4 R 04-01-16 48.262409
7101 21 L 01-12-12 30.750564
7101 21 L 01-05-13 49.421243
7101 21 L 04-06-13 87.294821
7101 21 L 01-10-13 164.013138
我正在尝试的是使用如下所示的代码:
df %>%
select(WC, ASN, Date, Loss) %>%
group_by(WC) %>%
arrange(WC, ASN, Date, Loss) %>%
mutate(Days = Date - lag(Date))
生成一个像这样的新表:
WC ASN TS Date Loss Days Loss_A
7101 3 R 13-07-12 156.930422 0 156.930422
7101 3 R 02-08-12 168.401876 20 325.332298
7101 4 R 28-12-13 120.492081 0 120.492081
7101 4 R 16-10-15 46.012085 657 166.504166
7101 4 R 04-01-16 48.262409 80 214.766575
7101 21 L 01-12-12 30.750564 0 30.750564
7101 21 L 01-05-13 49.421243 151 80.171807
7101 21 L 04-06-13 87.294821 34 167.466628
7101 21 L 01-10-13 164.013138 119 331.479766
在这里,
如何修改dplyr中的代码来实现如上所示的最终表?
mutate()
无效,正如我所知,在使用lag()
时,有更好的方法吗?
答案 0 :(得分:1)
这有效:
df %>%
# start by making a date column that's a recognized date class so you can perform
# operations on it.
mutate(date = as.Date(Date, format = "%d-%m-%y")) %>%
# then group by all of the columns you want to use to id groups
group_by(WC, ASN, TS) %>%
# then compute the time intervals between rows using ifelse to deal with 1st rows,
# and compute the cumulative total loss within each group.
mutate(Days = ifelse(is.na(lag(date)), 0, date - lag(date)),
Loss_A = cumsum(Loss))
# drop the date column we created if you don't need it
select(-date)
结果:
Source: local data frame [9 x 7]
Groups: WC, ASN, TS [3]
WC ASN TS Date Loss Days Loss_A
<int> <int> <chr> <chr> <dbl> <dbl> <dbl>
1 7101 3 R 13-07-12 156.93042 0 156.93042
2 7101 3 R 02-08-12 168.40188 20 325.33230
3 7101 4 R 28-12-13 120.49208 0 120.49208
4 7101 4 R 16-10-15 46.01208 657 166.50417
5 7101 4 R 04-01-16 48.26241 80 214.76657
6 7101 21 L 01-12-12 30.75056 0 30.75056
7 7101 21 L 01-05-13 49.42124 151 80.17181
8 7101 21 L 04-06-13 87.29482 34 167.46663
9 7101 21 L 01-10-13 164.01314 119 331.47977