Question

我有一个如下所示的数据框：

WC     ASN  TS  Date        Loss
7101    3   R   13-07-12    156.930422
7101    3   R   02-08-12    168.401876
7101    4   R   28-12-13    120.492081
7101    4   R   16-10-15    46.012085
7101    4   R   04-01-16    48.262409
7101    21  L   01-12-12    30.750564
7101    21  L   01-05-13    49.421243
7101    21  L   04-06-13    87.294821
7101    21  L   01-10-13    164.013138

我正在尝试的是使用如下所示的代码：

df %>% 
  select(WC, ASN, Date, Loss) %>%
  group_by(WC) %>% 
  arrange(WC, ASN, Date, Loss) %>%
  mutate(Days = Date - lag(Date))

生成一个像这样的新表：

WC      ASN TS  Date        Loss        Days    Loss_A
7101    3   R   13-07-12    156.930422  0   156.930422
7101    3   R   02-08-12    168.401876  20  325.332298
7101    4   R   28-12-13    120.492081  0   120.492081
7101    4   R   16-10-15    46.012085   657 166.504166
7101    4   R   04-01-16    48.262409   80  214.766575
7101    21  L   01-12-12    30.750564   0   30.750564
7101    21  L   01-05-13    49.421243   151 80.171807
7101    21  L   04-06-13    87.294821   34  167.466628
7101    21  L   01-10-13    164.013138  119 331.479766

在这里，

对于每个WC，ASN和TS（有序组合，如7101,3,13-07-2012），第一行的天数为0，然后应为= recent_date - lagged_date
和Loss_A计算为cumsum，直到WC，ASN和TS（有序组合，如7101,3,13-07-2012）中至少有一个不同为止。

如何修改dplyr中的代码来实现如上所示的最终表？ mutate()无效，正如我所知，在使用lag()时，有更好的方法吗？

Answer 1

这有效：

df %>%
  # start by making a date column that's a recognized date class so you can perform
  # operations on it.
  mutate(date = as.Date(Date, format = "%d-%m-%y")) %>%
  # then group by all of the columns you want to use to id groups
  group_by(WC, ASN, TS) %>%
  # then compute the time intervals between rows using ifelse to deal with 1st rows,
  # and compute the cumulative total loss within each group.
  mutate(Days = ifelse(is.na(lag(date)), 0, date - lag(date)),
         Loss_A = cumsum(Loss))
  # drop the date column we created if you don't need it
  select(-date)

结果：

Source: local data frame [9 x 7]
Groups: WC, ASN, TS [3]

     WC   ASN    TS     Date      Loss  Days    Loss_A
  <int> <int> <chr>    <chr>     <dbl> <dbl>     <dbl>
1  7101     3     R 13-07-12 156.93042     0 156.93042
2  7101     3     R 02-08-12 168.40188    20 325.33230
3  7101     4     R 28-12-13 120.49208     0 120.49208
4  7101     4     R 16-10-15  46.01208   657 166.50417
5  7101     4     R 04-01-16  48.26241    80 214.76657
6  7101    21     L 01-12-12  30.75056     0  30.75056
7  7101    21     L 01-05-13  49.42124   151  80.17181
8  7101    21     L 04-06-13  87.29482    34 167.46663
9  7101    21     L 01-10-13 164.01314   119 331.47977

使用dplyr

1 个答案: