我想通过匹配df2
和df1
将df1$District_name
中的一列与df2$Districts
合并。
但是df1$District_name
和df2$Districts
中的字符值顺序不同,并且df1
和df2
的长度也不相同。
值不完全匹配。 df1
的行多于df2
,因此这些额外的区名的对应值应为零。
df1=data.frame(State_name=c("Maharashtra","Andhra Pradesh","Bihar","Bihar","West Bengal","Gujarat","Gujarat","Assam"),
District_name=c("Nashik","Chittoor","Madhepura","Kishanganj","Howrah","Gandhinagar","Ahmadabad","Sivasagar"),
Value1=c(5,3,6,4,4,3,2,4))
df2=data.frame(Districts=c("Nashik","Chitoor","Kishanganj","Madhepur","Sibhasagar","Ahmadabad"),
FinanceIndex=c(0.20975,0.12187,0.37155,0.66128,0.10918,0.54730))
# df1
State_name District_name Value1
1 Maharashtra Nashik 5
2 Andhra Pradesh Chittoor 3
3 Bihar Madhepura 6
4 Bihar Kishanganj 4
5 West Bengal Howrah 4
6 Gujarat Gandhinagar 3
7 Gujarat Ahmadabad 2
8 Assam Sivasagar 4
# df2
Districts FinanceIndex
1 Nashik 0.20975
2 Chitoor 0.12187
3 Kishanganj 0.37155
4 Madhepur 0.66128
5 Sibhasagar 0.10918
6 Ahmadabad 0.54730
我使用了match函数,但是由于拼写差异,我将其中大多数设置为零值。
index<-match(df1$District_name, df2$Districts)
df1$finindex=df2$FinanceIndex[index]
df1$finindex[is.na(df1$finindex]=0
对于字符串匹配,我发现此函数可以匹配类似的语音单词:
library(RecordLinkage)
soundex('Nellore')==soundex('Vellore')
#FALSE
输出应为:
# df1
State_name District_name Value1 finindex
1 Maharashtra Nashik 5 0.20975
2 Andhra Pradesh Chittoor 3 0.12187
3 Bihar Madhepura 6 0.66128
4 Bihar Kishanganj 4 0.37155
5 West Bengal Howrah 4 0.00000
6 Gujarat Gandhinagar 3 0.00000
7 Gujarat Ahmadabad 2 0.54730
8 Assam Sivasagar 4 0.10918
这两个功能可以一起使用来解决问题吗?还是其他解决问题的方法?
答案 0 :(得分:1)
一种选择是与stringddist
进行部分匹配
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("District_name" = "Districts")) %>%
select(-Districts)
# State_name District_name Value1 FinanceIndex
#1 Maharashtra Nashik 5 0.20975
#2 Andhra Pradesh Chittoor 3 0.12187
#3 Bihar Madhepura 6 0.66128
#4 Bihar Kishanganj 4 0.37155
#5 West Bengal Howrah 4 NA
#6 Gujarat Gandhinagar 3 NA
#7 Gujarat Ahmadabad 2 0.54730
#8 Assam Sivasagar 4 0.10918