我有一些我想要矢量化的代码,但我不确定如何。以下代码提供了一些示例数据,包括名称和地址。
name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md",
"811 quincy st washington dc", "1911 1st st rockville md")
source1 <- data.frame(name, address)
name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long diner",
"joes crag shack", "mike lowry place", "holiday inn", "zummer")
name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
"1100 21st st nw washington dc", "1804 w 5th st wilmington de",
"1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
"400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address)
此块通过R的原生adist
函数计算两列文本之间的levenshtein distince,然后应用min
函数。
dist.name<- adist(source1$name,source2$name, partial = TRUE, ignore.case = TRUE)
dist.address <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
min.name<-apply(dist.name, 2, min)
min.address <- apply(dist.address, 2, min)
我想做以下事情:
source1$name
与source2$name
匹配。如果1的结果产生NA,则使用levenshtein距离基于source1$address
和source2$address
进行匹配。我尝试过使用for循环,它适用于1而不是2.这是我用来尝试并合并两者的代码:
match.s1.s2<-NULL
for(i in 1:nrow(dist.name)){
for(j in 1:nrow(dist.address)){
if(is.na(match(min.name[i], dist.name[i, ]))) {
s2.i <- match(min.address[j], dist.address[j,])
s1.i <- i
match.s1.s2 <- match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name,
s1name=source1[s1.i,]$name, adist=min.name[j],
s1.i.address = source1[s1.i,]$address,
s2.i.address = source2[s2.i,]$address),match.s1.s2)
} else {
s2.i<-match(min.name[i],dist.name[i,])
s1.i<-i
match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name, s1name=source1[s1.i,]$name,
adist=min.name[i], s1.i.address = source1[s1.i,]$address,
s2.i.address = source2[s2.i,]$address),match.s1.s2)
}
}
}
我的问题是它的速度很慢,最终产生的数据框太大了。最终结果,数据框match.s1.s2
应与source1具有相同的行数。任何建议或帮助将不胜感激。感谢。
答案 0 :(得分:1)
使用标准化分数(0到1之间)会更有效。这样您就可以使用向量化的ifelse
来仅更改对应的地址分数的NA
。对于非标准化分数,您必须更改整行。试试这种方法:
dist.mat.nm <- adist(source1$name, source2$name, partial = TRUE, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
#If you use non-normalized distances
dist.mat <- dist.mat.nm
for(i in 1:nrow(dist.mat)){
if(is.na(dist.mat[i, ])) dist.mat[i, ] <- dist.mat.ad[i, ]
}
#If you use normalized distances
dist.mat <- ifelse(is.na(dist.mat.nm), dist.mat.ad, dist.mat.nm)
which.match <- function(x, nm) return(nm[which(x == min(x))[1]])
matches <- apply(dist.mat, 1, which.match, nm = source2$name)
这可以改善性能并解决您的问题。如果你愿意改变到标准化的距离(而不是levenshtein),我会推荐Jaro-Winkler's。