假设我们有一张食品表:
product_id <- c(1, 1, 2, 2, 3, 3)
name <- c("Cheddar", "Cheddar", "Apple", "Apple", "Pizza", "Pizza")
category <- c("Dairy", "Cheese", "Food", "Fruit", "Food", NA)
products <- data.frame(product_id, name, category)
在不规则的层次结构中设置类别:
level_1 <- c("Food", "Food", "Food", "Food", "Food")
level_2 <- c(NA, "Dairy", "Dairy", "Fruit", "Pizza")
level_3 <- c(NA, NA, "Cheese", NA, NA)
categories <- data.frame(level_1, level_2, level_3)
我最终的目标是删除重复的产品,并保持最低的层次结构级别(即更多详细信息)。
我不一定需要最详细地保留 row ,只需保留标签即可。因此,我们也可以只将最详细的类别名称应用于组中的所有行,然后选择要删除的行。 但是请记住,可能会有错误:我们可能会有一行Pizza = Fruit
和Pizza = Pizza
,应该将其忽略掉(这需要手动修复)。>
编辑:到目前为止,答案非常好,谢谢您的帮助。他们只缺少一件事:
在我的真实数据中,我的类别有误,所以我忽略了层次结构树不同部分中的重复项。想象一下clothing > pants > jeans
的此层次结构的另一部分。然后,如果我有这些产品重复项:
+---------+----------+
| Product | Category |
+---------+----------+
| Apple | Food |
+---------+----------+
| Apple | Jeans |
+---------+----------+
即使是更具体的类别,我也不想保留“牛仔裤”。
我能想到的唯一解决方案是这个(而且我不知道如何在R中实现它):
或者,解决方案可以是“最高级通用类别”的新列,如果这是一种更轻松的思考方式。
编辑#2-新数据集
product_id <- c(1, 1, 2, 2, 3, 3, 2)
name <- c("Cheddar", "Cheddar", "Apple", "Apple", "Pizza", "Pizza", "Apple")
category <- c("Dairy", "Cheese", "Food", "Fruit", "Food", NA, "Jeans")
products <- data.frame(product_id, name, category)
level_1 <- c("Food", "Food", "Food", "Food", "Food","Clothing", "Clothing", "Clothing")
level_2 <- c(NA, "Dairy", "Dairy", "Fruit", "Pizza", NA, "Pants", "Pants")
level_3 <- c(NA, NA, "Cheese", NA, NA, NA, NA, "Jeans")
categories <- data.frame(level_1, level_2, level_3)
目标:
OR
答案 0 :(得分:1)
我们可以采用“长”格式,按正确的顺序arrange
行
library(dplyr)
library(tidyr)
newdat <- categories %>%
pivot_longer(everything(), names_to = 'product_id',
values_drop_na = TRUE) %>%
distinct %>%
arrange(factor(product_id, levels = rev(names(categories))))
通过“ product_id”,“ name”分组后,将其用于match
和slice
的第二个数据集“ products”
products %>%
group_by(product_id, name) %>%
slice(na.omit(match(newdat$value, category))[1])
# A tibble: 3 x 3
# Groups: product_id, name [3]
# product_id name category
# <dbl> <chr> <chr>
#1 1 Cheddar Cheese
#2 2 Apple Fruit
#3 3 Pizza Food
答案 1 :(得分:1)
另一个dplyr
/ tidyr
选项可能是
products %>%
mutate(level = case_when(category %in% level_1 ~ 1,
category %in% level_2 ~ 2,
category %in% level_3 ~ 3
)) %>%
group_by(product_id) %>%
drop_na() %>%
slice_max(level)
返回
# A tibble: 3 x 4
# Groups: product_id [3]
product_id name category level
<dbl> <chr> <chr> <dbl>
1 1 Cheddar Cheese 3
2 2 Apple Fruit 2
3 3 Pizza Food 1