我具有以下格式的数据集:
CustomerId City ProductID Related_Products
1 A 102 100,102,103,104,105
1 A 105 102, 200, 302
2 B 234 100, 202
3 C 340 343, 432
4 C 400 401
和
ProductID City OfferID
102 A 1000
100 A 1001
401 C 1002
我想加入这两个表,以便如果第二个表中的ProductID出现在第一个表的Related_Products列中,并且对应的城市匹配,则应将该产品的报价通知客户。
最终输出:
CustomerId City ProductID Related_Products Offers
1 A 102 100,102,103,104,105 1000, 1001
1 A 105 102, 200, 302 NA
2 B 234 100, 202 NA
3 C 340 343, 432 NA
4 C 400 401 1002
注意:所有数字都是ID,Related_Products列是字符串连接列,但我也可以将这种格式编入列表(而不是用逗号分隔的字符串):
CustomerId City ProductID Related_Products Offers
1 A 102 list(100,102,104,105,401) 1001,1000
1 A 105 list(102, 200, 302) NA
2 B 234 list(100, 202) NA
3 C 340 list(343, 432) NA
4 C 400 list(401) 1002
答案 0 :(得分:3)
使用separate_rows
中的tidyr
,我们可以将df1
带长格式,在left_join
上进行df2
并将数据转换成逗号分隔的值由CustomerId, ProductID, City
分组。
library(dplyr)
df1 %>%
tidyr::separate_rows(Related_Products, convert = TRUE) %>%
left_join(df2, by = c("City" = "City", "Related_Products" = "ProductID")) %>%
group_by(CustomerId, ProductID, City) %>%
summarise(Related_Products = toString(Related_Products),
Offer = toString(na.omit(OfferID)))
# CustomerId ProductID City Related_Products Offer
# <int> <int> <chr> <chr> <chr>
#1 1 102 A 100, 102, 103, 104, 105 1001, 1000
#2 1 105 A 102, 200, 302 1000
#3 2 234 B 100, 202 ""
#4 3 340 C 343, 432 ""
#5 4 400 C 401 1002
数据
df1 <- structure(list(CustomerId = c(1L, 1L, 2L, 3L, 4L), City = c("A",
"A", "B", "C", "C"), ProductID = c(102L, 105L, 234L, 340L, 400L
), Related_Products = c("100,102,103,104,105", "102,200,302",
"100,202", "343,432", "401")), class = "data.frame", row.names = c(NA,-5L))
df2 <- structure(list(ProductID = c(102L, 100L, 401L), City = c("A",
"A", "C"), OfferID = 1000:1002), class = "data.frame", row.names = c(NA, -3L))
答案 1 :(得分:2)
我们可以使用regex_left_join
中的fuzzyjoin
library(fuzzyjoin)
library(dplyr)
library(stringr)
regex_left_join(df1, df2, by = c("Related_Products" = "ProductID", "City")) %>%
group_by(CustomerId, City = City.x,
ProductID = ProductID.x, Related_Products) %>%
summarise(OfferID = str_c(OfferID, collapse=","))
df1 <- structure(list(CustomerId = c(1L, 1L, 2L, 3L, 4L), City = c("A",
"A", "B", "C", "C"), ProductID = c(102L, 105L, 234L, 340L, 400L
), Related_Products = c("100,102,103,104,105", "102, 200, 302",
"100, 202", "343, 432", "401")), class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(ProductID = c(102L, 100L, 401L), City = c("A",
"A", "C"), OfferID = 1000:1002), class = "data.frame", row.names = c(NA,
-3L))