R

时间:2019-12-04 04:38:34

标签: r join dplyr

我具有以下格式的数据集:

CustomerId   City    ProductID     Related_Products 

    1         A         102         100,102,103,104,105
    1         A         105         102, 200, 302
    2         B         234         100, 202
    3         C         340         343, 432
    4         C         400         401

ProductID     City      OfferID
  102          A          1000
  100          A          1001
  401          C          1002

我想加入这两个表,以便如果第二个表中的ProductID出现在第一个表的Related_Products列中,并且对应的城市匹配,则应将该产品的报价通知客户。

最终输出:

CustomerId   City    ProductID     Related_Products         Offers

    1         A         102         100,102,103,104,105     1000, 1001
    1         A         105         102, 200, 302            NA
    2         B         234         100, 202                 NA   
    3         C         340         343, 432                 NA
    4         C         400         401                      1002

注意:所有数字都是ID,Related_Products列是字符串连接列,但我也可以将这种格式编入列表(而不是用逗号分隔的字符串):

CustomerId   City    ProductID     Related_Products                  Offers

    1         A         102         list(100,102,104,105,401)         1001,1000
    1         A         105         list(102, 200, 302)                NA
    2         B         234         list(100, 202)                     NA   
    3         C         340         list(343, 432)                     NA
    4         C         400         list(401)                         1002

2 个答案:

答案 0 :(得分:3)

使用separate_rows中的tidyr,我们可以将df1带长格式,在left_join上进行df2并将数据转换成逗号分隔的值由CustomerId, ProductID, City分组。

library(dplyr)

df1 %>%
 tidyr::separate_rows(Related_Products, convert = TRUE) %>%
 left_join(df2, by = c("City" = "City", "Related_Products" = "ProductID")) %>%
  group_by(CustomerId, ProductID, City) %>%
  summarise(Related_Products = toString(Related_Products), 
            Offer = toString(na.omit(OfferID)))

#  CustomerId ProductID City  Related_Products        Offer     
#       <int>     <int> <chr> <chr>                   <chr>     
#1          1       102 A     100, 102, 103, 104, 105 1001, 1000
#2          1       105 A     102, 200, 302           1000      
#3          2       234 B     100, 202                ""        
#4          3       340 C     343, 432                ""        
#5          4       400 C     401                     1002 

数据

df1 <- structure(list(CustomerId = c(1L, 1L, 2L, 3L, 4L), City = c("A", 
"A", "B", "C", "C"), ProductID = c(102L, 105L, 234L, 340L, 400L
), Related_Products = c("100,102,103,104,105", "102,200,302", 
"100,202", "343,432", "401")), class = "data.frame", row.names = c(NA,-5L))

df2 <- structure(list(ProductID = c(102L, 100L, 401L), City = c("A", 
"A", "C"), OfferID = 1000:1002), class = "data.frame", row.names = c(NA, -3L))

答案 1 :(得分:2)

我们可以使用regex_left_join中的fuzzyjoin

library(fuzzyjoin)
library(dplyr)
library(stringr)
regex_left_join(df1, df2, by = c("Related_Products" = "ProductID", "City")) %>% 
    group_by(CustomerId, City = City.x, 
           ProductID = ProductID.x, Related_Products) %>%
    summarise(OfferID = str_c(OfferID, collapse=","))

数据

df1 <- structure(list(CustomerId = c(1L, 1L, 2L, 3L, 4L), City = c("A", 
        "A", "B", "C", "C"), ProductID = c(102L, 105L, 234L, 340L, 400L
        ), Related_Products = c("100,102,103,104,105", "102, 200, 302", 
        "100, 202", "343, 432", "401")), class = "data.frame", row.names = c(NA, 
        -5L))

df2 <- structure(list(ProductID = c(102L, 100L, 401L), City = c("A", 
        "A", "C"), OfferID = 1000:1002), class = "data.frame", row.names = c(NA, 
        -3L))