Question

我有两个数据集：

DS1-包含主题列表，其中包含名称，身份证号和就业状态的列

DS2-包含相同的主题名称和ID号列表，但其中一些缺少第二个数据集。
最后，它包含“教育程度”的第三列。

我想将“教育”列合并到第一个数据集中。我已经使用了按ID编号排序的合并功能来完成此操作，但是由于第二个数据集上缺少某些ID编号，因此我想按名称合并其余的教育级别作为第二选择。有没有办法使用dplyr / tidyverse？

Answer 1

有两种方法可以执行此操作。根据您的喜好选择一个。

第一个选项：

#here I left join twice and select columns each time to ensure there is no duplication like '.x' '.y'
finalDf = DS1 %>% 
  dplyr::left_join(DS2 %>% 
                     dplyr::select(ID,EducationLevel1=EducationLevel),by=c('ID')) %>% 
  dplyr::left_join(DS2 %>% 
                     dplyr::select(Name,EducationLevel2=EducationLevel),by=c('Name')) %>% 
  dplyr::mutate(FinalEducationLevel = ifelse(is.na(EducationLevel1),EducationLevel2,EducationLevel1))

第二个选项：

#first find the IDs which are present in the 2nd dataset

commonIds = DS1 %>% 
  dplyr::inner_join(DS2 %>% 
                      dplyr::select(ID,EducationLevel),by=c('ID'))

#now the records where ID was not present in DS2

idsNotPresent = DS1 %>% 
  dplyr::filter(!ID %in% commonIds$ID) %>% 
  dplyr::left_join(DS2 %>% 
                     dplyr::select(Name,EducationLevel),by=c('Name'))

#bind these two dfs to get the final df

finalDf = bind_rows(commonIds,idsNotPresent)

让我知道这是否可行。

Answer 2

对我来说，makeshift程序员的应答程序工作人员的第二个选择。非常感谢。本来可以为我的实际数据集使用它，但是基本结构运行得很好，而且很容易适应

R：基于多个列的匹配将一个列从一个数据集添加到另一个

2 个答案: