在匹配列不按相同顺序排列时,基于部分字符串匹配合并来自另一个数据帧的值

时间:2019-06-29 08:25:47

标签: r dictionary match

我想通过匹配df2df1df1$District_name中的一列与df2$Districts合并。  但是df1$District_namedf2$Districts中的字符值顺序不同,并且df1df2的长度也不相同。  值不完全匹配。 df1的行多于df2,因此这些额外的区名的对应值应为零。

 df1=data.frame(State_name=c("Maharashtra","Andhra Pradesh","Bihar","Bihar","West Bengal","Gujarat","Gujarat","Assam"),
           District_name=c("Nashik","Chittoor","Madhepura","Kishanganj","Howrah","Gandhinagar","Ahmadabad","Sivasagar"),
           Value1=c(5,3,6,4,4,3,2,4))

df2=data.frame(Districts=c("Nashik","Chitoor","Kishanganj","Madhepur","Sibhasagar","Ahmadabad"),
           FinanceIndex=c(0.20975,0.12187,0.37155,0.66128,0.10918,0.54730))


# df1
      State_name District_name Value1
1    Maharashtra        Nashik      5
2 Andhra Pradesh      Chittoor      3
3          Bihar     Madhepura      6
4          Bihar    Kishanganj      4
5    West Bengal        Howrah      4
6        Gujarat   Gandhinagar      3
7        Gujarat     Ahmadabad      2
8          Assam     Sivasagar      4

# df2
      Districts FinanceIndex
1        Nashik      0.20975
2       Chitoor      0.12187
3    Kishanganj      0.37155
4      Madhepur      0.66128
5    Sibhasagar      0.10918
6     Ahmadabad      0.54730

我使用了match函数,但是由于拼写差异,我将其中大多数设置为零值。

index<-match(df1$District_name, df2$Districts)
df1$finindex=df2$FinanceIndex[index]
df1$finindex[is.na(df1$finindex]=0

对于字符串匹配,我发现此函数可以匹配类似的语音单词:

library(RecordLinkage)
soundex('Nellore')==soundex('Vellore')
#FALSE

输出应为:

# df1
     State_name District_name Value1 finindex
1    Maharashtra        Nashik      5  0.20975
2 Andhra Pradesh      Chittoor      3  0.12187
3          Bihar     Madhepura      6  0.66128
4          Bihar    Kishanganj      4  0.37155
5    West Bengal        Howrah      4  0.00000
6        Gujarat   Gandhinagar      3  0.00000
7        Gujarat     Ahmadabad      2  0.54730
8          Assam     Sivasagar      4  0.10918

这两个功能可以一起使用来解决问题吗?还是其他解决问题的方法?

1 个答案:

答案 0 :(得分:1)

一种选择是与stringddist进行部分匹配

library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("District_name" = "Districts")) %>%
    select(-Districts)
#      State_name District_name Value1 FinanceIndex
#1    Maharashtra        Nashik      5      0.20975
#2 Andhra Pradesh      Chittoor      3      0.12187
#3          Bihar     Madhepura      6      0.66128
#4          Bihar    Kishanganj      4      0.37155
#5    West Bengal        Howrah      4           NA
#6        Gujarat   Gandhinagar      3           NA
#7        Gujarat     Ahmadabad      2      0.54730
#8          Assam     Sivasagar      4      0.10918