我有两个正在使用的数据集。第一个是:
data_1 <- tribble(
~shop_name, ~sub_category,
"A", "Blu-ray, DVDs, CD",
"B", "Sneakers, Make-up, Blu-ray",
"C", "Camera, Optic, DVDs",
"D", "Flower, Notebooks, Make-up",
)
第二个是:
data_2 <- tribble(
~sub_category, ~main_category,
"Blu-ray", "Electronic",
"DVDs", "Electronic",
"CD", "Electronic",
"Sneakers", "Fashion",
"Make-up", "Fashion",
"Camera", "Electronic",
"Optic", "Health",
"Flower", "Home",
)
现在,我想执行左联接以将主要类别添加到data_1中。最终数据应如下所示:
merged_data <- tribble(
~shop_name, ~sub_category, ~main_category,
"A", "Blu-ray, DVDs, CD", "Electronic, Electronic, Electronic",
"B", "Sneakers, Make-up, Blu-ray", "Fashion, Fashion, Electronic",
"C", "Camera, Optic", "Electronic, Health",
"D", "Flower", "Home"
)
我的编码如下所示:
data3 <- left_join(data_1, data_2, by = "sub_category")
但是不知何故,main_category返回了NA。有人可以帮我吗?预先感谢。
答案 0 :(得分:0)
您基本上需要从data_1
拆分子类别字符串,然后进行连接,即
data_1 %>%
separate_rows(sub_category, sep = ', ') %>%
left_join(data_2, by = 'sub_category') %>%
group_by(shop_name) %>%
summarise_all(funs(toString))
给出,
# A tibble: 4 x 3 shop_name sub_category main_category <chr> <chr> <chr> 1 A Blu-ray, DVDs, CD Electronic, Electronic, Electronic 2 B Sneakers, Make-up, Blu-ray Fashion, Fashion, Electronic 3 C Camera, Optic, DVDs Electronic, Health, Electronic 4 D Flower, Notebooks, Make-up Home, NA, Fashion
如果您有更多列,则summarise_all
需要替换为summarise_at(vars(contains('category')), funs(toString))
答案 1 :(得分:0)
下面有两个data.table
解决方案,用于记录:
代码
您可以将subcategory
的{{1}}中的每个字符串直接与data_1
中的相应main_category
进行匹配:
data_2
您也可以将require(data.table); setDT(data_1); setDT(data_2)
data_1[, main_category := sapply(sub_category, function(x){
str = unlist(strsplit(x, ', '))
match = as.numeric(sapply(str, function(x) data_2[, which(x == sub_category)]))
data_2[match, paste(main_category, collapse = ', ')]
})]
转换为长格式,然后在data_1
上与data_2
结合:
sub_category
结果
data_1 = data_1[, .(sub_category = unlist(strsplit(sub_category, ', '))), keyby = shop_name] # data_1 to long format
dt_final = merge(data_1, data_2, by = 'sub_category', all = T) # Join data_1 and data_2 on sub_category
dt_final = dt_final[, lapply(.SD, function(x) paste(x, collapse = ', ')), keyby = shop_name]