将对应于相同对值“ A”和“ B”的数据帧col“ C”中的值分配/加入第二个数据帧。 R-dplyr

时间:2018-12-05 13:09:48

标签: r dataframe dplyr

我有两个数据框。

  • 第一个(df1)是协调字符串names1names2及其frequency的频率数据帧。
  • 第二个(df2)包含两列names1names2,其中包含一对或多对这些对。有时顺序不同。

我想在第一个dafarame df1

中的新列中分配频率
df1 <- tibble(names1 = c('architecture', 'assessment', 'build'), 
              names2 = c('build', 'data', 'data'),
              frequency = c(36,13,720))

# A tibble: 3 x 3
  names1       names2 frequency
  <chr>        <chr>      <dbl>
1 architecture build         36
2 assessment   data          13
3 build        data          720

第二个数据帧df2 中。

df2 <- tibble(names1 = c('architecture', 'build', 'assessment','assessment', 'business'), 
              names2 = c('build','architecture', 'data', 'data', 'strategy'))

  names1       names2        
  <chr>        <chr>         
1 architecture build         
2 build        architecture  
3 assessment   data          
4 assessment   data   
5 business     strategy         

对于此结果:

  names1       names2        frequency
  <chr>        <chr>         <dbl>
1 architecture build         36
2 build        architecture  36
3 assessment   data          13
4 assessment   data          13
5 business     strategy      0

注意:有时候我有df1$names1 == df2$names1 && df1$names2 == df2$names2 df1$names1 == df2$names2 && df1$names2 == df2$names1

1 architecture build         36
2 build        architecture  36

注意:我想保留不匹配的行

5 business     strategy      0

2 个答案:

答案 0 :(得分:2)

这里的问题是,名称列的顺序对于连接很重要,因此您必须更新数据集并应用一致的顺序。

这是一个dplyr解决方案:

library(dplyr)

df1 <- tibble(names1 = c('architecture', 'assessment', 'build'), 
              names2 = c('build', 'data', 'data'),
              frequency = c(36,13,720))

df2 <- tibble(names1 = c('architecture', 'build', 'assessment','assessment', 'business'), 
              names2 = c('build','architecture', 'data', 'data', 'strategy'))

# update df1
df1 = df1 %>% 
  rowwise() %>% 
  mutate(names = paste0(sort(c(names1, names2)), collapse = "_")) %>% 
  select(names, frequency)

# update df2
df2 = df2 %>% 
  rowwise() %>% 
  mutate(names = paste0(sort(c(names1, names2)), collapse = "_"))

# join datasets and update columns
left_join(df2, df1, by="names") %>%
  mutate(frequency = coalesce(frequency, 0)) %>%
  select(-names) %>%
  ungroup()

#   names1       names2       frequency
#   <chr>        <chr>            <dbl>
# 1 architecture build               36
# 2 build        architecture        36
# 3 assessment   data                13
# 4 assessment   data                13
# 5 business     strategy             0

答案 1 :(得分:2)

具有一些tidyr::left_join解决方案的递归dplyr

require(dplyr
require(tidyr) 
left_join(df2,df1,by=c("names1","names2")) %>% 
   left_join(df1,by=c(names1="names2",names2="names1")) %>% 
   mutate(frequency=coalesce(frequency.x,frequency.y,0)) %>% 
   select(-frequency.x,-frequency.y)

此解决方案保留df2中各列的顺序。之所以有mutate和select行,是因为left_join添加了新列,需要将这些新列合并回单个频率列(并将NA替换为0),然后将其删除。

结果:

# A tibble: 5 x 3
  names1       names2       frequency
  <chr>        <chr>            <dbl>
1 architecture build               36
2 build        architecture        36
3 assessment   data                13
4 assessment   data                13
5 business     strategy             0