如何创建标记连续旅程的分组变量?

时间:2019-04-25 19:50:36

标签: r dplyr

我有一些需要按照特定条件进行分组的航次。

Ship| From     |  To      | Departure_From  | Departure_To
   1| HAMBURG  | SETUBAL  | 16-09-2018 22:12| 08-10-2018 13:42
   1| SETUBAL  | NAPOLI   | 08-10-2018 13:42| 16-10-2018 00:18
   2| HAMBURG  | SETUBAL  | 14-10-2018 18:30| 07-11-2018 13:55
   2| SETUBAL  | HAMBURG  | 07-11-2018 13:55| 20-11-2018 13:16
   3| JEDDAH   | ALGECIRAS| 10-05-2018 21:46| 30-05-2018 17:20
   3| ALGECIRAS| TANGIER  | 30-05-2018 17:20| 31-05-2018 08:41
   3| TANGIER  | ALGECIRAS| 05-09-2018 21:34| 13-09-2018 22:22
   3| ALGECIRAS| TANGIER  | 13-09-2018 22:22| 15-09-2018 08:40
   4| FOS      | ALGECIRAS| 05-09-2018 11:02| 07-09-2018 20:18
   4| ALGECIRAS| Baltiysk | 07-09-2018 20:18| 15-09-2018 05:28
   4| Baltiysk | GDANSK   | 15-09-2018 05:28| 16-09-2018 14:34

Ship列具有船号,“ From”和“ To”列是端口名称,Departure_From是“ From”端口的出港,而Departure_To是“ To”端口的出港。我需要通过以下方式将此特定数据集分组: 注意,如果是连续航行,那么Departure_To日期将与下一个条目的Departure_From日期相同,港口也将相同。如果不同,那就是不同的航程。

  1. 第一艘船从汉堡出发,前往塞图巴尔,然后在下一趟航程中从塞图巴尔出发,前往那不勒斯。注意,第一个条目的Departure_To日期与下一个条目的Departure_From日期相同,端口也是如此。因此,这是一次连续航行。我想将其合并为一个航程,从汉堡(第一个港口)到那不勒斯(最后一个港口),在Departure_From下应该是汉堡的出发日期,而Departure_To应该是那不勒斯的出发日期。
  2. 对于船号3,有两次航程。第一次航行是从吉达到阿尔赫西拉斯,然后是阿尔杰西拉斯到丹吉尔(这是一个连续的航行),第二次航行是从丹吉尔到阿尔赫西拉斯,再到阿尔杰西拉斯回到丹吉尔。 因此,在这种情况下,应该有两组,一组从吉达到丹吉尔,第二组从丹吉尔到丹吉尔。
  3. 4号船的情况更为复杂,因为该船从Fos出发前往Algeciras,然后从Algeciras到Baltiysk,最后从Baltiysk到GDANSK。在这种情况下,应将3个航程合并为一个航程(因为这是连续航程-至今与下一次入境的日期相同),从Fos到GDANSK。

我希望最终结果看起来像这样。

Ship| From     |  To      | Departure_From  | Departure_To
   1| HAMBURG  | NAPOLI   | 16-09-2018 22:12| 16-10-2018 00:18
   2| HAMBURG  | HAMBURG  | 14-10-2018 18:30| 20-11-2018 13:16
   3| JEDDAH   | TANGIER  | 10-05-2018 21:46| 31-05-2018 08:41
   3| TANGIER  | TANGIER  | 05-09-2018 21:34| 15-09-2018 08:40
   4| FOS      | GDANSK   | 05-09-2018 11:02| 16-09-2018 14:34

用于创建上述数据集的代码。

data.frame(Ship= c(1,1,2,2,3,3,3,3,4,4,4), 
           From=c("HAMBURG","SETUBAL","HAMBURG","SETUBAL","JEDDAH","ALGECIRAS","TANGIER","ALGECIRAS","FOS SUR MER","ALGECIRAS","Baltiysk"), 
           To= c("SETUBAL","NAPOLI","SETUBAL","HAMBURG","ALGECIRAS","TANGIER","ALGECIRAS","TANGIER","ALGECIRAS","Baltiysk","GDANSK"), 
           Departure_From= c("16-09-2018  22:12:00",
                "08-10-2018  13:42:00",
                "14-10-2018  18:30:00",
                "07-11-2018  13:55:00",
                "10-05-2018  21:46:00",
                "30-05-2018  17:20:00",
                "05-09-2018  21:34:00",
                "13-09-2018  22:22:00",
                "05-09-2018  11:02:00",
                "07-09-2018  20:18:00",
                "15-09-2018  05:28:00"), 
           Departure_To= c("08-10-2018  13:42:00",
               "16-10-2018  00:18:00",
               "07-11-2018  13:55:00",
               "20-11-2018  13:16:00",
               "30-05-2018  17:20:00",
               "31-05-2018  08:41:00",
               "13-09-2018  22:22:00",
               "15-09-2018  08:40:00",
               "07-09-2018  20:18:00",
               "15-09-2018  05:28:00",
               "16-09-2018  14:34:00"
))

任何帮助将不胜感激。 (我宁愿在Tidyverse中这样做,因为我对此很满意)

1 个答案:

答案 0 :(得分:3)

创建分组ID的技巧是将cumsumdplyr::lag(或lead)一起使用,并弄清楚如何只创建要让新组开始计算的行到TRUE。如果新旅程的Departure_From与上一行的Departure_To不同,在这里我们要标记一个新旅程。如果它是该船的第一行,则由于我们设置了default = "",因此它将自动不同。

有了每条船的航行编号后,可以很容易地summarise分别获得每次航行的第一个和最后一个值。请注意,您提供的数据称为城市FOS SUR MER

library(tidyverse)
tbl <- tibble(Ship = c(1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4), From = c("HAMBURG", "SETUBAL", "HAMBURG", "SETUBAL", "JEDDAH", "ALGECIRAS", "TANGIER", "ALGECIRAS", "FOS SUR MER", "ALGECIRAS", "Baltiysk"), To = c("SETUBAL", "NAPOLI", "SETUBAL", "HAMBURG", "ALGECIRAS", "TANGIER", "ALGECIRAS", "TANGIER", "ALGECIRAS", "Baltiysk", "GDANSK"), Departure_From = c("16-09-2018  22:12:00", "08-10-2018  13:42:00", "14-10-2018  18:30:00", "07-11-2018  13:55:00", "10-05-2018  21:46:00", "30-05-2018  17:20:00", "05-09-2018  21:34:00", "13-09-2018  22:22:00", "05-09-2018  11:02:00", "07-09-2018  20:18:00", "15-09-2018  05:28:00"), Departure_To = c("08-10-2018  13:42:00", "16-10-2018  00:18:00", "07-11-2018  13:55:00", "20-11-2018  13:16:00", "30-05-2018  17:20:00", "31-05-2018  08:41:00", "13-09-2018  22:22:00", "15-09-2018  08:40:00", "07-09-2018  20:18:00", "15-09-2018  05:28:00", "16-09-2018  14:34:00"))
tbl %>%
  group_by(Ship) %>%
  mutate(trip_num = cumsum(Departure_From != lag(Departure_To, default = ""))) %>%
  group_by(Ship, trip_num) %>%
  summarise(
    From = first(From),
    To = last(To),
    Departure_From = first(Departure_From),
    Departure_To = last(Departure_To)
  )
#> # A tibble: 5 x 6
#> # Groups:   Ship [4]
#>    Ship trip_num From        To      Departure_From      Departure_To      
#>   <dbl>    <int> <chr>       <chr>   <chr>               <chr>             
#> 1     1        1 HAMBURG     NAPOLI  16-09-2018  22:12:… 16-10-2018  00:18…
#> 2     2        1 HAMBURG     HAMBURG 14-10-2018  18:30:… 20-11-2018  13:16…
#> 3     3        1 JEDDAH      TANGIER 10-05-2018  21:46:… 31-05-2018  08:41…
#> 4     3        2 TANGIER     TANGIER 05-09-2018  21:34:… 15-09-2018  08:40…
#> 5     4        1 FOS SUR MER GDANSK  05-09-2018  11:02:… 16-09-2018  14:34…

reprex package(v0.2.1)于2019-04-25创建