在两个数据集之间匹配列表中的值

时间:2019-10-30 09:01:07

标签: r dplyr vlookup

我有两个正在使用的数据集。第一个是:

data_1 <- tribble(
  ~shop_name, ~sub_category,
  "A",        "Blu-ray, DVDs, CD",
  "B",        "Sneakers, Make-up, Blu-ray",         
  "C",        "Camera, Optic, DVDs",
  "D",        "Flower, Notebooks, Make-up", 
)

第二个是:

data_2 <- tribble(
  ~sub_category, ~main_category,
  "Blu-ray",      "Electronic",
  "DVDs",         "Electronic",        
  "CD",           "Electronic",
  "Sneakers",     "Fashion",
  "Make-up",      "Fashion", 
  "Camera",       "Electronic",
  "Optic",        "Health", 
  "Flower",       "Home",
)

现在,我想执行左联接以将主要类别添加到data_1中。最终数据应如下所示:

merged_data <- tribble(
  ~shop_name, ~sub_category,                 ~main_category,
  "A",        "Blu-ray, DVDs, CD",            "Electronic,  Electronic,  Electronic",
  "B",        "Sneakers, Make-up, Blu-ray",   "Fashion,  Fashion,  Electronic",      
  "C",        "Camera, Optic",                "Electronic, Health",
  "D",        "Flower",                       "Home"
)

我的编码如下所示:

data3 <- left_join(data_1, data_2, by = "sub_category")

但是不知何故,main_category返回了NA。有人可以帮我吗?预先感谢。

2 个答案:

答案 0 :(得分:0)

您基本上需要从data_1拆分子类别字符串,然后进行连接,即

data_1 %>% 
 separate_rows(sub_category, sep = ', ') %>% 
 left_join(data_2, by = 'sub_category') %>% 
 group_by(shop_name) %>% 
 summarise_all(funs(toString))

给出,

# A tibble: 4 x 3
  shop_name sub_category               main_category                     
  <chr>     <chr>                      <chr>                             
1 A         Blu-ray, DVDs, CD          Electronic, Electronic, Electronic
2 B         Sneakers, Make-up, Blu-ray Fashion, Fashion, Electronic      
3 C         Camera, Optic, DVDs        Electronic, Health, Electronic    
4 D         Flower, Notebooks, Make-up Home, NA, Fashion

如果您有更多列,则summarise_all需要替换为summarise_at(vars(contains('category')), funs(toString))

答案 1 :(得分:0)

下面有两个data.table解决方案,用于记录:

代码

您可以将subcategory的{​​{1}}中的每个字符串直接与data_1中的相应main_category进行匹配:

data_2

您也可以将require(data.table); setDT(data_1); setDT(data_2) data_1[, main_category := sapply(sub_category, function(x){ str = unlist(strsplit(x, ', ')) match = as.numeric(sapply(str, function(x) data_2[, which(x == sub_category)])) data_2[match, paste(main_category, collapse = ', ')] })] 转换为长格式,然后在data_1上与data_2结合:

sub_category

结果

data_1 = data_1[, .(sub_category = unlist(strsplit(sub_category, ', '))), keyby = shop_name] # data_1 to long format
dt_final = merge(data_1, data_2, by = 'sub_category', all = T) # Join data_1 and data_2 on sub_category
dt_final = dt_final[, lapply(.SD, function(x) paste(x, collapse = ', ')), keyby = shop_name]