矢量化文本挖掘多列

时间:2018-01-08 16:31:09

标签: r for-loop vectorization tm levenshtein-distance

我有一些我想要矢量化的代码,但我不确定如何。以下代码提供了一些示例数据,包括名称和地址。

name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md", 
         "811 quincy st washington dc", "1911 1st st rockville md")

source1 <- data.frame(name, address)

name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long diner",
      "joes crag shack", "mike lowry place", "holiday inn", "zummer")

name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
         "1100 21st st nw washington dc", "1804 w 5th st wilmington de",
         "1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
         "400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address) 

此块通过R的原生adist函数计算两列文本之间的levenshtein distince,然后应用min函数。

dist.name<- adist(source1$name,source2$name, partial = TRUE, ignore.case = TRUE)
dist.address <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)

min.name<-apply(dist.name, 2, min)
min.address <- apply(dist.address, 2, min)

我想做以下事情:

  1. 根据最低levenshtein距离将source1$namesource2$name匹配。
  2. 如果1的结果产生NA,则使用levenshtein距离基于source1$addresssource2$address进行匹配。我尝试过使用for循环,它适用于1而不是2.这是我用来尝试并合并两者的代码:

    match.s1.s2<-NULL  
    for(i in 1:nrow(dist.name)){
      for(j in 1:nrow(dist.address)){
    if(is.na(match(min.name[i], dist.name[i, ]))) {
    s2.i <- match(min.address[j], dist.address[j,])
    s1.i <- i
    match.s1.s2 <- match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name, 
                                             s1name=source1[s1.i,]$name, adist=min.name[j], 
                                             s1.i.address = source1[s1.i,]$address,
                                             s2.i.address = source2[s2.i,]$address),match.s1.s2)
    
    } else {
      s2.i<-match(min.name[i],dist.name[i,])
      s1.i<-i
      match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name, s1name=source1[s1.i,]$name, 
                                adist=min.name[i], s1.i.address = source1[s1.i,]$address,
                                s2.i.address = source2[s2.i,]$address),match.s1.s2)
        }
    
      }
    
    }
    
  3. 我的问题是它的速度很慢,最终产生的数据框太大了。最终结果,数据框match.s1.s2应与source1具有相同的行数。任何建议或帮助将不胜感激。感谢。

1 个答案:

答案 0 :(得分:1)

使用标准化分数(0到1之间)会更有效。这样您就可以使用向量化的ifelse来仅更改对应的地址分数的NA。对于非标准化分数,您必须更改整行。试试这种方法:

dist.mat.nm <- adist(source1$name, source2$name, partial = TRUE, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)

#If you use non-normalized distances
dist.mat <- dist.mat.nm
for(i in 1:nrow(dist.mat)){
  if(is.na(dist.mat[i, ])) dist.mat[i, ] <- dist.mat.ad[i, ]
}

#If you use normalized distances
dist.mat <- ifelse(is.na(dist.mat.nm), dist.mat.ad, dist.mat.nm)

which.match <- function(x, nm) return(nm[which(x == min(x))[1]])

matches <- apply(dist.mat, 1, which.match, nm = source2$name)

这可以改善性能并解决您的问题。如果你愿意改变到标准化的距离(而不是levenshtein),我会推荐Jaro-Winkler's。