让我们说我有带有以下变量的数据帧df1
Continent Country
1 Europe Russia
2 Asia Myanmar (Burma)
3 africa Benin
4 africa Botswana
5 africa Burkina
和带有以下变量的df2
Continent Country
1 Europe Russian Federation
2 Asia Myanmar
3 africa Benin,new
4 africa Botswana
5 africa Burkina
如何使用部分匹配将Country的2 df合并在一起
答案 0 :(得分:2)
您可以合并前五个字符。您将需要安装stringr
软件包
复制数据
a<- data.frame( Continent=c("Europe","Asia","africa","africa","africa"), Country=c("Russia","Myanmar (Burma)","Benin","Botswana","Burkina"))
b <- data.frame( Continent=c("Europe","Asia","africa","africa","africa"), Country=c("Russian Federation","Myanmar","Benin,new","Botswana","Burkina"))
创建一个以小写字母前五个字母开头的变量
a$key <- stringr::str_extract(tolower(a$Country), "\\b[a-z]{0,5}")
b$key <- stringr::str_extract(tolower(b$Country), "\\b[a-z]{0,5}")
,然后在新密钥上进行合并(您可能需要在合并之前重命名cols
merge( a , b , by="key")
答案 1 :(得分:0)
了解最终/所需数据帧的外观可能会有所帮助。
您可以在合并这两个数据帧时考虑使用fuzzyjoin
软件包。一种方法是使用str_detect
并查看另一个中是否包含一个Country
字符串。
library(tidyverse)
library(fuzzyjoin)
mf <- function(a, b) str_detect(a, b) | str_detect(b, a)
fuzzy_semi_join(df1, df2, by = "Country", match_fun = mf)
Continent Country
1 Europe Russia
2 Asia Myanmar (Burma)
3 Africa Benin
4 Africa Botswana
5 Africa Burkina
内部联接将应如何匹配行(同时保留两个Country
列进行比较):
fuzzy_inner_join(df1, df2, by = "Country", match_fun = mf)
Continent.x Country.x Continent.y Country.y
1 Europe Russia Europe Russian Federation
2 Asia Myanmar (Burma) Asia Myanmar
3 Africa Benin Africa Benin,new
4 Africa Botswana Africa Botswana
5 Africa Burkina Africa Burkina