Question

我正在尝试通过unique_id将数据集2合并/合并/ vlookup到数据集1中。数据集2具有相同的unique_id及其相关信息的许多重复项。数据集2中只有一列对amount_due很重要。我想通过正确的unique_id在下面的数据集2中将amount_due列添加到数据集1中。

数据集一

    unique_id  df1  df2  df3     df4
    1234       1    h    8/4/18  no
    2341       2    nl   8/5/18  yes
    3412       3    sg   8/3/18  no
    4213       4    hi   7/3/18  yes

数据集二

    unique_id  df1  df2  df3     df4  amount_due  df5
    1234       1    h    8/4/18  no   $100        mcd
    1234       1    h    8/4/18  no   $100        mcd
    1234       1    h    8/4/18  no   $100        mcd
    2341       2    nl   8/5/18  yes  $1          hsn
    3412       3    sg   8/3/18  no   $200        bcbs
    3412       3    sg   8/3/18  no   $200        bcbs
    4213       4    hi   7/3/18  yes  $2.22       r
    4213       4    hi   7/3/18  yes  $2.22       r

期望的输出如下

    unique_id  df1  df2  df3     df4  amount_due
    1234       1    h    8/4/18  no   $100
    2341       2    nl   8/5/18  yes  $1
    3412       3    sg   8/3/18  no   $200
    4213       4    hi   7/3/18  yes  $2.22

Answer 1

在dplyr中，我们可以仅select df2中感兴趣的行，然后对其进行过滤以仅包含distinct之前的join行将其（此处无关紧要）移至df1。

library(dplyr)
df2 %>%
    select(unique_id, amount_due) %>%
    distinct() %>%
    right_join(df1, by = 'unique_id')

  unique_id amount_due df1 df2    df3 df4
1      1234       $100   1   h 8/4/18  no
2      2341         $1   2  nl 8/5/18 yes
3      3412       $200   3  sg 8/3/18  no
4      4213      $2.22   4  hi 7/3/18 yes

Answer 2

使用R base

> merge(df1, unique(df2)[, c("unique_id", "amount_due")], by="unique_id")
  unique_id df1 df2    df3 df4 amount_due
1      1234   1   h 8/4/18  no       $100
2      2341   2  nl 8/5/18 yes         $1
3      3412   3  sg 8/3/18  no       $200
4      4213   4  hi 7/3/18 yes      $2.22

等价于R中的Vlookup

2 个答案: