我有一个输入的大学名称列表,其中包含拼写错误和不一致之处。我需要将它们与大学名称的正式列表进行匹配,以将我的数据链接在一起。
我知道模糊匹配/联接是我的解决之道,但是我对正确的方法有些迷失。任何帮助将不胜感激。
d<-data.frame(name=c("University of New Yorkk", "The University of South
Carolina", "Syracuuse University", "University of South Texas",
"The University of No Carolina"), score = c(1,3,6,10,4))
y<-data.frame(name=c("University of South Texas", "The University of North
Carolina", "University of South Carolina", "Syracuse
University","University of New York"), distance = c(100, 400, 200, 20, 70))
我希望输出将它们尽可能紧密地合并在一起
matched<-data.frame(name=c("University of New Yorkk", "The University of South Carolina",
"Syracuuse University","University of South Texas","The University of No Carolina"),
correctmatch = c("University of New York", "University of South Carolina",
"Syracuse University","University of South Texas", "The University of North Carolina"))
答案 0 :(得分:1)
我将adist()
用于此类操作,并且几乎没有名为closest_match()
的包装函数,以帮助将值与一组“良好/允许”值进行比较。
library(magrittr) # for the %>%
closest_match <- function(bad_value, good_values) {
distances <- adist(bad_value, good_values, ignore.case = TRUE) %>%
as.numeric() %>%
setNames(good_values)
distances[distances == min(distances)] %>%
names()
}
sapply(d$name, function(x) closest_match(x, y$name)) %>%
setNames(d$name)
University of New Yorkk The University of South\n Carolina Syracuuse University
"University of New York" "University of South Carolina" "University of New York"
University of South Texas The University of No Carolina
"University of South Texas" "University of South Carolina"
adist()
利用Levenshtein distance比较两个字符串之间的相似性。