我有一些需要按照特定条件进行分组的航次。
Ship| From | To | Departure_From | Departure_To
1| HAMBURG | SETUBAL | 16-09-2018 22:12| 08-10-2018 13:42
1| SETUBAL | NAPOLI | 08-10-2018 13:42| 16-10-2018 00:18
2| HAMBURG | SETUBAL | 14-10-2018 18:30| 07-11-2018 13:55
2| SETUBAL | HAMBURG | 07-11-2018 13:55| 20-11-2018 13:16
3| JEDDAH | ALGECIRAS| 10-05-2018 21:46| 30-05-2018 17:20
3| ALGECIRAS| TANGIER | 30-05-2018 17:20| 31-05-2018 08:41
3| TANGIER | ALGECIRAS| 05-09-2018 21:34| 13-09-2018 22:22
3| ALGECIRAS| TANGIER | 13-09-2018 22:22| 15-09-2018 08:40
4| FOS | ALGECIRAS| 05-09-2018 11:02| 07-09-2018 20:18
4| ALGECIRAS| Baltiysk | 07-09-2018 20:18| 15-09-2018 05:28
4| Baltiysk | GDANSK | 15-09-2018 05:28| 16-09-2018 14:34
Ship列具有船号,“ From”和“ To”列是端口名称,Departure_From是“ From”端口的出港,而Departure_To是“ To”端口的出港。我需要通过以下方式将此特定数据集分组: 注意,如果是连续航行,那么Departure_To日期将与下一个条目的Departure_From日期相同,港口也将相同。如果不同,那就是不同的航程。
我希望最终结果看起来像这样。
Ship| From | To | Departure_From | Departure_To
1| HAMBURG | NAPOLI | 16-09-2018 22:12| 16-10-2018 00:18
2| HAMBURG | HAMBURG | 14-10-2018 18:30| 20-11-2018 13:16
3| JEDDAH | TANGIER | 10-05-2018 21:46| 31-05-2018 08:41
3| TANGIER | TANGIER | 05-09-2018 21:34| 15-09-2018 08:40
4| FOS | GDANSK | 05-09-2018 11:02| 16-09-2018 14:34
用于创建上述数据集的代码。
data.frame(Ship= c(1,1,2,2,3,3,3,3,4,4,4),
From=c("HAMBURG","SETUBAL","HAMBURG","SETUBAL","JEDDAH","ALGECIRAS","TANGIER","ALGECIRAS","FOS SUR MER","ALGECIRAS","Baltiysk"),
To= c("SETUBAL","NAPOLI","SETUBAL","HAMBURG","ALGECIRAS","TANGIER","ALGECIRAS","TANGIER","ALGECIRAS","Baltiysk","GDANSK"),
Departure_From= c("16-09-2018 22:12:00",
"08-10-2018 13:42:00",
"14-10-2018 18:30:00",
"07-11-2018 13:55:00",
"10-05-2018 21:46:00",
"30-05-2018 17:20:00",
"05-09-2018 21:34:00",
"13-09-2018 22:22:00",
"05-09-2018 11:02:00",
"07-09-2018 20:18:00",
"15-09-2018 05:28:00"),
Departure_To= c("08-10-2018 13:42:00",
"16-10-2018 00:18:00",
"07-11-2018 13:55:00",
"20-11-2018 13:16:00",
"30-05-2018 17:20:00",
"31-05-2018 08:41:00",
"13-09-2018 22:22:00",
"15-09-2018 08:40:00",
"07-09-2018 20:18:00",
"15-09-2018 05:28:00",
"16-09-2018 14:34:00"
))
任何帮助将不胜感激。 (我宁愿在Tidyverse中这样做,因为我对此很满意)
答案 0 :(得分:3)
创建分组ID的技巧是将cumsum
与dplyr::lag
(或lead
)一起使用,并弄清楚如何只创建要让新组开始计算的行到TRUE
。如果新旅程的Departure_From
与上一行的Departure_To
不同,在这里我们要标记一个新旅程。如果它是该船的第一行,则由于我们设置了default = ""
,因此它将自动不同。
有了每条船的航行编号后,可以很容易地summarise
分别获得每次航行的第一个和最后一个值。请注意,您提供的数据称为城市FOS SUR MER
。
library(tidyverse)
tbl <- tibble(Ship = c(1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4), From = c("HAMBURG", "SETUBAL", "HAMBURG", "SETUBAL", "JEDDAH", "ALGECIRAS", "TANGIER", "ALGECIRAS", "FOS SUR MER", "ALGECIRAS", "Baltiysk"), To = c("SETUBAL", "NAPOLI", "SETUBAL", "HAMBURG", "ALGECIRAS", "TANGIER", "ALGECIRAS", "TANGIER", "ALGECIRAS", "Baltiysk", "GDANSK"), Departure_From = c("16-09-2018 22:12:00", "08-10-2018 13:42:00", "14-10-2018 18:30:00", "07-11-2018 13:55:00", "10-05-2018 21:46:00", "30-05-2018 17:20:00", "05-09-2018 21:34:00", "13-09-2018 22:22:00", "05-09-2018 11:02:00", "07-09-2018 20:18:00", "15-09-2018 05:28:00"), Departure_To = c("08-10-2018 13:42:00", "16-10-2018 00:18:00", "07-11-2018 13:55:00", "20-11-2018 13:16:00", "30-05-2018 17:20:00", "31-05-2018 08:41:00", "13-09-2018 22:22:00", "15-09-2018 08:40:00", "07-09-2018 20:18:00", "15-09-2018 05:28:00", "16-09-2018 14:34:00"))
tbl %>%
group_by(Ship) %>%
mutate(trip_num = cumsum(Departure_From != lag(Departure_To, default = ""))) %>%
group_by(Ship, trip_num) %>%
summarise(
From = first(From),
To = last(To),
Departure_From = first(Departure_From),
Departure_To = last(Departure_To)
)
#> # A tibble: 5 x 6
#> # Groups: Ship [4]
#> Ship trip_num From To Departure_From Departure_To
#> <dbl> <int> <chr> <chr> <chr> <chr>
#> 1 1 1 HAMBURG NAPOLI 16-09-2018 22:12:… 16-10-2018 00:18…
#> 2 2 1 HAMBURG HAMBURG 14-10-2018 18:30:… 20-11-2018 13:16…
#> 3 3 1 JEDDAH TANGIER 10-05-2018 21:46:… 31-05-2018 08:41…
#> 4 3 2 TANGIER TANGIER 05-09-2018 21:34:… 15-09-2018 08:40…
#> 5 4 1 FOS SUR MER GDANSK 05-09-2018 11:02:… 16-09-2018 14:34…
由reprex package(v0.2.1)于2019-04-25创建