因此,我有客户在线购买机票的这些数据。我想看看其中有多少人预订了回程票。因此,基本上,我想针对同一个人和帐户将起始城市与紧邻行的目标城市进行匹配,反之亦然,这将为我提供他们的双向旅行数据,然后我要计算他们的旅行天数。我正在R中尝试执行此操作,但是我无法将原点与直接行的目的地进行匹配,反之亦然。
我已经对客户的帐号进行了排序,以手动查看是否有回程并且有很多回程。
数据如下:
Account number origin city Destination city Date
1 London chicago 7/22/2018
2 Milan London 7/23/2018
2 London Milan 7/28/2018
1 chicago london 8/22/2018
答案 0 :(得分:2)
另一种选择是在字段相反的情况下加入自身。
编辑::添加了“ trip_num”以更好地处理同一个人的重复旅行。
library(dplyr)
# First, convert date field to Date type
df <- df %>%
mutate(Date = lubridate::mdy(Date)) %>%
# update with M-M's suggestion in comments
mutate_at(.vars = vars(origin_city, Destination_city), .funs = toupper) %>%
# EDIT: adding trip_num to protect against extraneous joins for repeat trips
group_by(Account_number, origin_city, Destination_city) %>%
mutate(trip_num = row_number()) %>%
ungroup()
df2 <- df %>%
left_join(df, by = c("Account_number", "trip_num",
"origin_city" = "Destination_city",
"Destination_city" = "origin_city")) %>%
mutate(days = (Date.x - Date.y)/lubridate::ddays(1))
> df2
# A tibble: 6 x 7
Account_number origin_city Destination_city Date.x trip_num Date.y days
<int> <chr> <chr> <date> <int> <date> <dbl>
1 1 LONDON CHICAGO 2018-07-22 1 2018-08-22 -31
2 2 MILAN LONDON 2018-07-23 1 2018-07-28 -5
3 2 LONDON MILAN 2018-07-28 1 2018-07-23 5
4 1 CHICAGO LONDON 2018-08-22 1 2018-07-22 31
5 2 MILAN LONDON 2018-08-23 2 2018-08-28 -5
6 2 LONDON MILAN 2018-08-28 2 2018-08-23 5
数据:(增加了第2个帐户的重复行程)
df <- read.table(
header = T,
stringsAsFactors = F,
text = "Account_number origin_city Destination_city Date
1 London chicago 7/22/2018
2 Milan London 7/23/2018
2 London Milan 7/28/2018
1 chicago london 8/22/2018
2 Milan London 8/23/2018
2 London Milan 8/28/2018")