如何将数据集的两列合并到其他数据集的一列

时间:2019-02-27 06:14:42

标签: r

我有两个数据集,如下所示

full.name是一列,全名是full.name的第一个单词,而df1中的country不正确,因此我想匹配df1(full.namefirst of full name)与df2的列(名称),如果df1的两列中的任何一个与df2的列匹配,则在对应的列中应打印更正的国家/地区值,并且如果full.name and首先是df1的全名与df2的名称不匹配,则应打印full.name and的值,全名的第一位,并在name和更正的国家/地区的值中显示NA

df1:

full.name    first of full name  country
karachi east  karachi            pakistan
phu my        phu                england
phu my        phu                india
delhi         delhi              china
west australia west              england
west australia west              australia
abu dhabai     abu               xyz
south africa   south             africa

df2:

name            corrected.country
karachi         pakistan 
phu my          england
delhi           India
west australia  australia
abu             dubai

我希望输出为

full.name    first of full name  country     name          corrected country
karachi east  karachi            pakistan    karachi        pakistan 
phu my        phu                england     phu my         england
phu my        phu                india       phu my         england
delhi         delhi              china       delhi          India
west australia west              england     west australia australia
west australia west              australia   west australia australia
abu dhabai     abu               xyz         abu            dubai
south africa   south             africa      NA              NA

如果任何df1列与df2(列名)匹配,我想匹配df1的full.namefirst of full name以与df2的名称匹配,然后在输出中,我希望更正的国家/地区列的名称列,如果df1列中的任何一个与df2的名称列相匹配full.namefirst of full name

我知道我使这个点有点复杂,但是我真的想解决这个问题,请帮忙

2 个答案:

答案 0 :(得分:0)

只要您的数据框中没有重复项,这应该就可以工作

library(dplyr)

mutate(inner_join(df1, df2, by = c("full.name"= "name")), name = full.name) %>%
  dplyr::union(., mutate(inner_join(df1, df2, by = c("first.of.full.name" = "name")), name = first.of.full.name)) %>% 
       select(1,2,3,5,4) #just ordering the columns


       full.name first.of.full.name   country           name corrected.country
1         phu my                phu   england         phu my           england
2         phu my                phu     india         phu my           england
3          delhi              delhi     china          delhi             India
4 west australia               west   england west australia         australia
5 west australia               west australia west australia         australia
6   karachi east            karachi  pakistan        karachi          pakistan
7     abu dhabai                abu       xyz            abu             dubai

当您仅合并两个data.frames时,被合并的wo列就变成一个,因此我不得不为仍在结果中的name列找到一种解决方法。

在复制我的代码时,请注意列名。但是它们在R中应该相同。

更新:

要包含不在df2中的名称:

> df1_2
       full.name first.of.full.name   country
1   karachi east            karachi  pakistan
2         phu my                phu   england
3         phu my                phu     india
4          delhi              delhi     china
5 west australia               west   england
6 west australia               west australia
7     abu dhabai                abu       xyz
8      Stuttgart          Stuttgart   germany

bind_rows(df3, df1_2[rowSums(sapply(1:2, function(x) df1_2[,x] %in% df2$name)) == 0,])

full.name first.of.full.name   country           name corrected.country
1         phu my                phu   england         phu my           england
2         phu my                phu     india         phu my           england
3          delhi              delhi     china          delhi             India
4 west australia               west   england west australia         australia
5 west australia               west australia west australia         australia
6   karachi east            karachi  pakistan        karachi          pakistan
7     abu dhabai                abu       xyz            abu             dubai
8      Stuttgart          Stuttgart   germany           <NA>              <NA>

df1_2是您的df1,带有新行,而df3是上面的结果。

答案 1 :(得分:0)

我首先重新创建您的数据集。您不需要执行此部分,因为您已经拥有了自己的数据,但是为了其他想复制该解决方案的人,我将其包含在此处。

df1 <- data.frame(stringsAsFactors=FALSE,
            full.name = c("karachi east", "phu my", "phu my", "delhi",
                          "west australia", "west australia", "abu dhabai"),
   first.of.full.name = c("karachi", "phu", "phu", "delhi", "west", "west",
                          "abu"),
              country = c("pakistan", "england", "india", "china", "england",
                          "australia", "xyz"))
df2 <- data.frame(stringsAsFactors=FALSE,
                name = c("karachi", "phu my", "delhi", "west australia", "abu"),
   corrected.country = c("pakistan", "england", "India", "australia", "dubai")
)

现在,加载dplyr软件包。您可以使用inner_join将每个“键”变量(即full.name和first.of.full.name)与df2匹配,然后使用union()将两组数据连接在一起。

library(dplyr)

df3 <- union(inner_join(df1, df2, by = c("first.of.full.name" = "name")) , 
      inner_join(df1, df2, by = c("full.name" = "name")))

df3
#>        full.name first.of.full.name   country corrected.country
#> 1   karachi east            karachi  pakistan          pakistan
#> 2          delhi              delhi     china             India
#> 3     abu dhabai                abu       xyz             dubai
#> 4         phu my                phu   england           england
#> 5         phu my                phu     india           england
#> 6 west australia               west   england         australia
#> 7 west australia               west australia         australia

如果将其分解为单独的步骤,则为

library(dplyr)

df3 <- inner_join(df1, df2, by = c("first.of.full.name" = "name"))
df3
#>      full.name first.of.full.name  country corrected.country
#> 1 karachi east            karachi pakistan          pakistan
#> 2        delhi              delhi    china             India
#> 3   abu dhabai                abu      xyz             dubai

df4 <- inner_join(df1, df2, by = c("full.name" = "name"))
df4
#>        full.name first.of.full.name   country corrected.country
#> 1         phu my                phu   england           england
#> 2         phu my                phu     india           england
#> 3          delhi              delhi     china             India
#> 4 west australia               west   england         australia
#> 5 west australia               west australia         australia

df5 <- union(df3, df4)
df5
#>        full.name first.of.full.name   country corrected.country
#> 1   karachi east            karachi  pakistan          pakistan
#> 2          delhi              delhi     china             India
#> 3     abu dhabai                abu       xyz             dubai
#> 4         phu my                phu   england           england
#> 5         phu my                phu     india           england
#> 6 west australia               west   england         australia
#> 7 west australia               west australia         australia

reprex package(v0.2.0)于2019-02-27创建。