我正在尝试通过levenshtein距离(adist
中的R
)找到某些文本字符串的n个最佳匹配。以下示例应阐明:
name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md",
"811 quincy st washington dc", "1911 1st st rockville md")
source1 <- data.frame(name, address)
name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long diner","joes crag shack", "mike lowry place", "holiday inn", "zummer")
name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
"1100 21st st nw washington dc", "1804 w 5th st wilmington de",
"1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
"400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address)
以下内容使用地址和名称计算编辑距离。
dist.mat.nm <- adist(source1$name, source2$name, partial = T, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
以下内容返回前五个最佳匹配项。
imat <- apply(dist.mat.nm, 1, order)[1:5, ]
top.nm <- data.frame(name = source1$name)
tmp <- apply(imat, 1, function(i) source2$name[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.nm <- cbind(top.nm, tmp)
imat <- apply(dist.mat.ad, 1, order)[1:5, ]
top.ad <- data.frame(address = source1$address)
tmp <- apply(imat, 1, function(i) source2$address[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.ad <- cbind(top.ad, tmp)
我想做的是:
which
和grepl
会更好,因为我愿意接受建议。)adist
的值。对于每个top.ad
,top.nm
列,所需的结果是对应的index.match
列和包含distance
值的adist
列。 / p>
例如,top.name.1
的行索引为c(7, 6, 4, 1)
。
任何帮助将不胜感激。谢谢。
更新:我发现以下代码为第一个匹配项提供了行索引,但是我希望能够对x和y使用向量:
find.index <- function(x, y) return(which(grepl(paste(x, collapse = "|"), y, fixed = F)))
vec <- find.index(source1$name, source2$name)
我该如何返回整个向量?