Question

我有输入

＆＃34;我自己旅行，我刚带了一张世界票去新加坡，达尔文，珀斯，阿德莱德，墨尔本，布里斯班，黄金成本，悉尼奥普拉，克里斯特彻奇，黄金海岸里奇兰，奥克兰，澳大利亚和fji。这是一个10个月的旅程。我将独自前进，我并不害怕，但我的朋友和家人似乎反对这个想法。我已经解释过这是安全的，我可能会在路上遇到人，宿舍并不像他们那样糟糕。对于我旅行的至少1/3，我将与朋友和家人住在一起。我很兴奋，但人们的悲观观点让我怀疑安全。我来自英国所以离家很远，他们很害怕我遇到麻烦。我从未去过美国＆＃34;

我有一个大到5000行的地方列表。如伦敦，新加坡，悉尼，奥克兰，斐济，黄金海岸，悉尼歌剧院，澳大利亚，英国，美国.......

问题通过匹配“地方列表”来获取输入中的地点。有拼写错误和最接近的比赛。 需要进行优化。

输出新加坡|达尔文|珀斯|阿德莱德|墨尔本|布里斯班|黄金海岸|悉尼歌剧院|克里斯特彻奇|奥克兰|澳大利亚|斐济|英国|美国

尝试过的方法

library(RecordLinkage)
library(stringdist)
input=tolower(gsub('[[:punct:]]', " ", input))
Places <- read.delim("\\Data\\Places_List.csv", row.names =NULL,header=TRUE,sep=",")
Places <-as.matrix(Places)
##################Different Methods Tried##########################
ClosestMatch2 = function(string, stringVector){

distance = levenshteinSim(string, stringVector);
stringVector[distance == max(distance)]
}
ClosestMatch2(input,Places)
###############The above 1 doesn't Work##################
ClosestMatch <- function(string,StringVector) {
matches <- agrep(string,StringVector,value=TRUE)
distance <- sdists(string,matches,method = "",weight = c(1, 0, 2))
matches <- data.frame(matches,as.numeric(distance))
matches <- subset(matches,distance==min(distance))
as.character(matches$matches)
}
ClosestMatch(input,Places)
########This work but not proper Results###########
k=as.matrix((sapply(input,agrep,places)))

######这也没有用

agrep, pmatch , str_detect(wont work for spelling Mistakes) doesn't work for bigger data sets

Answer 1

最近的匹配2工作，除此之外添加字符数差异和子字符串部分匹配以匹配拼写错误

R两个巨大数据集之间的相关匹配。即使有拼写错误

1 个答案: