R-在group_by中查找层次结构中的最低/最高级别

时间:2020-07-13 22:40:50

标签: r group-by dplyr hierarchy

假设我们有一张食品表:

product_id <- c(1, 1, 2, 2, 3, 3)
name <- c("Cheddar", "Cheddar", "Apple", "Apple", "Pizza", "Pizza")
category <- c("Dairy", "Cheese", "Food", "Fruit", "Food", NA)
products <- data.frame(product_id, name, category)

enter image description here

在不规则的层次结构中设置类别:

level_1 <- c("Food", "Food", "Food", "Food", "Food")
level_2 <- c(NA, "Dairy", "Dairy", "Fruit", "Pizza")
level_3 <- c(NA, NA, "Cheese", NA, NA)
categories <- data.frame(level_1, level_2, level_3)

enter image description here

我最终的目标是删除重复的产品,并保持最低的层次结构级别(即更多详细信息)。

enter image description here

我不一定需要最详细地保留 row ,只需保留标签即可。因此,我们也可以只将最详细的类别名称应用于组中的所有行,然后选择要删除的行。 但是请记住,可能会有错误:我们可能会有一行Pizza = FruitPizza = Pizza,应该将其忽略掉(这需要手动修复)。


编辑:到目前为止,答案非常好,谢谢您的帮助。他们只缺少一件事:

在我的真实数据中,我的类别有误,所以我忽略了层次结构树不同部分中的重复项。想象一下clothing > pants > jeans的此层次结构的另一部分。然后,如果我有这些产品重复项:

+---------+----------+
| Product | Category | 
+---------+----------+
|  Apple  |   Food   |
+---------+----------+
|  Apple  |   Jeans  |
+---------+----------+

即使是更具体的类别,我也不想保留“牛仔裤”。

我能想到的唯一解决方案是这个(而且我不知道如何在R中实现它):

  • 在产品表上放置层次结构的各个级别,然后根据类别进行填充
  • 按产品分组
  • 检查组中的所有行是否都在level_1上匹配
  • 如果是,请检查level_2,如果是,请检查level_3
  • 在每个阶段,如果不匹配是由于不适用而导致的,我们就有一个获胜者,并在该级别应用现有类别
  • 如果不匹配归因于不同的类别,请保留

或者,解决方案可以是“最高级通用类别”的新列,如果这是一种更轻松的思考方式。


编辑#2-新数据集

product_id <- c(1, 1, 2, 2, 3, 3, 2)
name <- c("Cheddar", "Cheddar", "Apple", "Apple", "Pizza", "Pizza", "Apple")
category <- c("Dairy", "Cheese", "Food", "Fruit", "Food", NA, "Jeans")
products <- data.frame(product_id, name, category)

enter image description here

level_1 <- c("Food", "Food", "Food", "Food", "Food","Clothing", "Clothing", "Clothing")
level_2 <- c(NA, "Dairy", "Dairy", "Fruit", "Pizza", NA, "Pants", "Pants")
level_3 <- c(NA, NA, "Cheese", NA, NA, NA, NA, "Jeans")
categories <- data.frame(level_1, level_2, level_3)

enter image description here

目标:

enter image description here

OR

enter image description here

2 个答案:

答案 0 :(得分:1)

我们可以采用“长”格式,按正确的顺序arrange

library(dplyr)
library(tidyr)
newdat <- categories %>%
              pivot_longer(everything(), names_to = 'product_id',
                       values_drop_na = TRUE) %>% 
              distinct  %>%
              arrange(factor(product_id, levels = rev(names(categories))))

通过“ product_id”,“ name”分组后,将其用于matchslice的第二个数据集“ products”

products %>%
   group_by(product_id, name) %>% 
   slice(na.omit(match(newdat$value, category))[1])
# A tibble: 3 x 3
# Groups:   product_id, name [3]
#  product_id name    category
#       <dbl> <chr>   <chr>   
#1          1 Cheddar Cheese  
#2          2 Apple   Fruit   
#3          3 Pizza   Food    

答案 1 :(得分:1)

另一个dplyr / tidyr选项可能是

products %>%
  mutate(level = case_when(category %in% level_1 ~ 1,
                           category %in% level_2 ~ 2,
                           category %in% level_3 ~ 3
                           )) %>%
  group_by(product_id) %>%
  drop_na() %>%
  slice_max(level)

返回

# A tibble: 3 x 4
# Groups:   product_id [3]
  product_id name    category level
       <dbl> <chr>   <chr>    <dbl>
1          1 Cheddar Cheese       3
2          2 Apple   Fruit        2
3          3 Pizza   Food         1