我有两个数据集,如下所示
full.name
是一列,全名是full.name
的第一个单词,而df1中的country
不正确,因此我想匹配df1(full.name
和first of full name
)与df2的列(名称),如果df1的两列中的任何一个与df2的列匹配,则在对应的列中应打印更正的国家/地区值,并且如果full.name and
首先是df1的全名与df2的名称不匹配,则应打印full.name and
的值,全名的第一位,并在name和更正的国家/地区的值中显示NA
df1:
full.name first of full name country
karachi east karachi pakistan
phu my phu england
phu my phu india
delhi delhi china
west australia west england
west australia west australia
abu dhabai abu xyz
south africa south africa
和
df2:
name corrected.country
karachi pakistan
phu my england
delhi India
west australia australia
abu dubai
我希望输出为
full.name first of full name country name corrected country
karachi east karachi pakistan karachi pakistan
phu my phu england phu my england
phu my phu india phu my england
delhi delhi china delhi India
west australia west england west australia australia
west australia west australia west australia australia
abu dhabai abu xyz abu dubai
south africa south africa NA NA
如果任何df1列与df2(列名)匹配,我想匹配df1的full.name
和first of full name
以与df2的名称匹配,然后在输出中,我希望更正的国家/地区列的名称列,如果df1列中的任何一个与df2的名称列相匹配full.name
或first of full name
我知道我使这个点有点复杂,但是我真的想解决这个问题,请帮忙
答案 0 :(得分:0)
只要您的数据框中没有重复项,这应该就可以工作
library(dplyr)
mutate(inner_join(df1, df2, by = c("full.name"= "name")), name = full.name) %>%
dplyr::union(., mutate(inner_join(df1, df2, by = c("first.of.full.name" = "name")), name = first.of.full.name)) %>%
select(1,2,3,5,4) #just ordering the columns
full.name first.of.full.name country name corrected.country
1 phu my phu england phu my england
2 phu my phu india phu my england
3 delhi delhi china delhi India
4 west australia west england west australia australia
5 west australia west australia west australia australia
6 karachi east karachi pakistan karachi pakistan
7 abu dhabai abu xyz abu dubai
当您仅合并两个data.frames
时,被合并的wo列就变成一个,因此我不得不为仍在结果中的name
列找到一种解决方法。
在复制我的代码时,请注意列名。但是它们在R中应该相同。
更新:
要包含不在df2中的名称:
> df1_2
full.name first.of.full.name country
1 karachi east karachi pakistan
2 phu my phu england
3 phu my phu india
4 delhi delhi china
5 west australia west england
6 west australia west australia
7 abu dhabai abu xyz
8 Stuttgart Stuttgart germany
bind_rows(df3, df1_2[rowSums(sapply(1:2, function(x) df1_2[,x] %in% df2$name)) == 0,])
full.name first.of.full.name country name corrected.country
1 phu my phu england phu my england
2 phu my phu india phu my england
3 delhi delhi china delhi India
4 west australia west england west australia australia
5 west australia west australia west australia australia
6 karachi east karachi pakistan karachi pakistan
7 abu dhabai abu xyz abu dubai
8 Stuttgart Stuttgart germany <NA> <NA>
df1_2是您的df1,带有新行,而df3是上面的结果。
答案 1 :(得分:0)
我首先重新创建您的数据集。您不需要执行此部分,因为您已经拥有了自己的数据,但是为了其他想复制该解决方案的人,我将其包含在此处。
df1 <- data.frame(stringsAsFactors=FALSE,
full.name = c("karachi east", "phu my", "phu my", "delhi",
"west australia", "west australia", "abu dhabai"),
first.of.full.name = c("karachi", "phu", "phu", "delhi", "west", "west",
"abu"),
country = c("pakistan", "england", "india", "china", "england",
"australia", "xyz"))
df2 <- data.frame(stringsAsFactors=FALSE,
name = c("karachi", "phu my", "delhi", "west australia", "abu"),
corrected.country = c("pakistan", "england", "India", "australia", "dubai")
)
现在,加载dplyr软件包。您可以使用inner_join将每个“键”变量(即full.name和first.of.full.name)与df2匹配,然后使用union()将两组数据连接在一起。
library(dplyr)
df3 <- union(inner_join(df1, df2, by = c("first.of.full.name" = "name")) ,
inner_join(df1, df2, by = c("full.name" = "name")))
df3
#> full.name first.of.full.name country corrected.country
#> 1 karachi east karachi pakistan pakistan
#> 2 delhi delhi china India
#> 3 abu dhabai abu xyz dubai
#> 4 phu my phu england england
#> 5 phu my phu india england
#> 6 west australia west england australia
#> 7 west australia west australia australia
如果将其分解为单独的步骤,则为
library(dplyr)
df3 <- inner_join(df1, df2, by = c("first.of.full.name" = "name"))
df3
#> full.name first.of.full.name country corrected.country
#> 1 karachi east karachi pakistan pakistan
#> 2 delhi delhi china India
#> 3 abu dhabai abu xyz dubai
df4 <- inner_join(df1, df2, by = c("full.name" = "name"))
df4
#> full.name first.of.full.name country corrected.country
#> 1 phu my phu england england
#> 2 phu my phu india england
#> 3 delhi delhi china India
#> 4 west australia west england australia
#> 5 west australia west australia australia
df5 <- union(df3, df4)
df5
#> full.name first.of.full.name country corrected.country
#> 1 karachi east karachi pakistan pakistan
#> 2 delhi delhi china India
#> 3 abu dhabai abu xyz dubai
#> 4 phu my phu england england
#> 5 phu my phu india england
#> 6 west australia west england australia
#> 7 west australia west australia australia
由reprex package(v0.2.0)于2019-02-27创建。