使用R合并具有部分匹配的数据帧

时间:2020-02-09 14:48:56

标签: r

让我们说我有带有以下变量的数据帧df1

Continent   Country
1   Europe  Russia
2   Asia    Myanmar (Burma)
3   africa  Benin
4   africa  Botswana
5   africa  Burkina

和带有以下变量的df2

Continent   Country
1   Europe  Russian Federation
2   Asia    Myanmar
3   africa  Benin,new
4   africa  Botswana
5   africa  Burkina

如何使用部分匹配将Country的2 df合并在一起

2 个答案:

答案 0 :(得分:2)

您可以合并前五个字符。您将需要安装stringr软件包

复制数据

a<- data.frame( Continent=c("Europe","Asia","africa","africa","africa"), Country=c("Russia","Myanmar (Burma)","Benin","Botswana","Burkina"))
b <- data.frame( Continent=c("Europe","Asia","africa","africa","africa"), Country=c("Russian Federation","Myanmar","Benin,new","Botswana","Burkina"))

创建一个以小写字母前五个字母开头的变量

 a$key <- stringr::str_extract(tolower(a$Country), "\\b[a-z]{0,5}")
 b$key <- stringr::str_extract(tolower(b$Country), "\\b[a-z]{0,5}")

,然后在新密钥上进行合并(您可能需要在合并之前重命名cols

  merge( a , b , by="key")

答案 1 :(得分:0)

了解最终/所需数据帧的外观可能会有所帮助。

您可以在合并这两个数据帧时考虑使用fuzzyjoin软件包。一种方法是使用str_detect并查看另一个中是否包含一个Country字符串。

library(tidyverse)
library(fuzzyjoin)

mf <- function(a, b) str_detect(a, b) | str_detect(b, a)

fuzzy_semi_join(df1, df2, by = "Country", match_fun = mf)

  Continent         Country
1    Europe          Russia
2      Asia Myanmar (Burma)
3    Africa           Benin
4    Africa        Botswana
5    Africa         Burkina

内部联接将应如何匹配行(同时保留两个Country列进行比较):

fuzzy_inner_join(df1, df2, by = "Country", match_fun = mf)

  Continent.x       Country.x Continent.y          Country.y
1      Europe          Russia      Europe Russian Federation
2        Asia Myanmar (Burma)        Asia            Myanmar
3      Africa           Benin      Africa          Benin,new
4      Africa        Botswana      Africa           Botswana
5      Africa         Burkina      Africa            Burkina