Dplyr左连接案例何时

时间:2017-11-17 19:47:48

标签: r dplyr left-join

使用两个表是否可以在某些条件下加入右表?

在这种情况下,如果是右表ProductHierarchyType == "LINE",那么我想要有条件地加入左表中的列名ProductLineID。 这将继续基于CLASS> GROUP> SUBGROUP> LINE的层次结构。

我尝试使用prodmap2创建其他列,但这给了我额外的列,我对如何正确处理该条件没有信心。

prodmap2<-prodmap%>% mutate(ProdClass = case_when(ProductHierarchyType=="CLASS" ~ ProductHierarchyID))%>% mutate(ProdGroup = case_when(ProductHierarchyType=="GROUP" ~ ProductHierarchyID))%>% mutate(ProdSUBGroup = case_when(ProductHierarchyType=="SUBGROUP" ~ ProductHierarchyID))%>% mutate(ProdLine = case_when(ProductHierarchyType=="LINE" ~ ProductHierarchyID))

左表:

structure(list(TerritoryKey = c("800046", "800046", "800046", 
"800046", "800046", "800046"), Material = c("000-40", "003-01", 
"003-40", "004-00", "004-05", "005-40"), TotalSales = c(61.68, 
94.27, 48227.14, 422.88, 45.4, 3723.92), ProductClassID = c("0012", 
"0012", "0012", "0012", "0012", "0012"), ProductGroupID = c("00120001", 
"00120001", "00120001", "00120002", "00120002", "00120001"), 
    ProductSubGroupID = c("001200010002", "001200010002", "001200010002", 
    "001200020002", "001200020002", "001200010002"), ProductLineID = c("001200010002000001", 
    "001200010002000001", "001200010002000001", "001200020002000001", 
    "001200020002000001", "001200010002000001"), StartDate = c("1/1/2016", 
    "1/1/2016", "1/1/2016", "1/1/2016", "1/1/2016", "1/1/2016"
    ), EndDate = c("12/31/2099", "12/31/2099", "12/31/2099", 
    "12/31/2099", "12/31/2099", "12/31/2099")), .Names = c("TerritoryKey", 
"Material", "TotalSales", "ProductClassID", "ProductGroupID", 
"ProductSubGroupID", "ProductLineID", "StartDate", "EndDate"), row.names = c(NA, 
-6L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), vars = "TerritoryKey", drop = TRUE, indices = list(
    0:5), group_sizes = 6L, biggest_group_size = 6L, labels = structure(list(
    TerritoryKey = "800046"), row.names = c(NA, -1L), class = "data.frame", vars = "TerritoryKey", drop = TRUE, .Names = "TerritoryKey"))

右表:

structure(list(CompProfileID = c("ALTC", "ALTC", "ALTC", "ALTC", 
"ALTC", "ALTC"), ProductBucketID = c("CORE", "CORE", "CORE", 
"CORE", "CORE", "CORE"), ProductHierarchyID = c("001200010001000001", 
"001200010001000003", "001200010001000009", "001200010002", "001200010003000001", 
"001200010004000004"), ProductHierarchyType = c("LINE", "LINE", 
"LINE", "SUBGROUP", "LINE", "LINE"), ExclusionFlag = c("N", "N", 
"N", "N", "N", "N"), StartDate = c("2017-01-01", "2017-01-01", 
"2017-01-01", "2017-01-01", "2017-01-01", "2017-01-01"), EndDate = c("2099-12-31", 
"2099-12-31", "2099-12-31", "2099-12-31", "2099-12-31", "2099-12-31"
), ExclusionType = c("", "", "", "", "", "")), .Names = c("CompProfileID", 
"ProductBucketID", "ProductHierarchyID", "ProductHierarchyType", 
"ExclusionFlag", "StartDate", "EndDate", "ExclusionType"), row.names = c(NA, 
6L), class = "data.frame")

1 个答案:

答案 0 :(得分:2)

有条件地加入多个列很难。我建议在加入之前将数据转换为"tidy data"形式。我的意思是,将与ID相关的列折叠为一对键和值列。

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

left_table_tidy <- left_table %>%
  ungroup() %>% 
  tibble::rowid_to_column(var = "unique_ID") %>% 
  gather(key = "ID_type", value = "ID", matches("Product.*ID")) %>%
  mutate(ID_type = recode(ID_type,
                          ProductClassID = "CLASS",
                          ProductGroupID = "GROUP",
                          ProductSubGroupID = "SUBGROUP",
                          ProductLineID = "LINE"))

然后,您可以按ID和ID类型加入数据。

table_joined <- inner_join(left_table_tidy,
                           right_table,
                           by = c("ID_type" = "ProductHierarchyType",
                                  "ID"      = "ProductHierarchyID"))

如您所知,此联接可能会在每个原始行中遇到多种类型。因此,您需要按照&#34; CLASS&gt; GROUP&gt; SUBGROUP&gt; LINE&#34;的顺序对行进行排序。并选择第一个以删除重复。

table_joined %>%
  group_by(unique_ID) %>%
  arrange(factor(ID_type, levels = c("CLASS", "GROUP", "SUBGROUP", "LINE"))) %>%
  slice(1L)
#> # A tibble: 4 x 14
#> # Groups:   unique_ID [4]
#>   unique_ID TerritoryKey Material TotalSales StartDate.x  EndDate.x
#>       <int>        <chr>    <chr>      <dbl>       <chr>      <chr>
#> 1         1       800046   000-40      61.68    1/1/2016 12/31/2099
#> 2         2       800046   003-01      94.27    1/1/2016 12/31/2099
#> 3         3       800046   003-40   48227.14    1/1/2016 12/31/2099
#> 4         6       800046   005-40    3723.92    1/1/2016 12/31/2099
#> # ... with 8 more variables: ID_type <chr>, ID <chr>, CompProfileID <chr>,
#> #   ProductBucketID <chr>, ExclusionFlag <chr>, StartDate.y <chr>,
#> #   EndDate.y <chr>, ExclusionType <chr>