Question

想法是比较两个字符串向量，例如：

select
img1.image_blob as blobA,
img2.image_blob as blobB,
img3.image_blob as blobC,
from price_history ph
left join imagedata img1 on img1.id = ph.image_name_homePage
left join imagedata img2 on img2.id = ph.image_name_loginPage
left join imagedata img3 on img3.id = ph.image_name_footer

并提出一种方法来给他们一些精确度。基本上添加具有数值的列c。

我的思路：

我们有：

df <- data.frame(a = c("New York 001", "Orlando 002", "Boston 003", "Chicago 004", "Atlanta 005"),
                 b = c("NEW YORK  001", "Orlando", "Boston (003)", "Chicago 005", "005 Atlanta"))

第一件事是第一件事 - 剥离白人，忽略案件并删除所有特殊字符，而我们就是这样。

> df
             a             b
1 New York 001 NEW YORK  001
2  Orlando 002       Orlando
3   Boston 003  Boston (003)
4  Chicago 004   Chicago 005
5  Atlanta 005   005 Atlanta

我们得到了什么：

df$a <- gsub("[[:space:]]|[[:punct:]]", "", toupper(df$a))
df$b <- gsub("[[:space:]]|[[:punct:]]", "", toupper(df$b))

所以现在我们处于问题的核心。

第一行将是100％匹配。第二行有7个匹配字符，最多10个字符。因此70％。第三名现在比赛为100％。第四名有90％的比赛。第五个是棘手的。人类的头脑告诉我他们匹配，但订单存在问题。但那并不是计算机的工作原理。实际上它可以被测量为70％匹配，因为在两个字符串中重复7个连续字符。

所以问题是：

如何对字符串比较进行定量测量？

也许有更好的方法可以做到这一点，因为我从未有过比较部分匹配字符串集的经验。并且采用这种特殊的可量化措施只是我直观的做事方式。如果R已经拥有一个能够以更好的方式完成所有这一切的库/功能，我根本不会感到惊讶。

Answer 1

A more correct answer with Rcpp:

library(Rcpp)

cppFunction('NumericVector commonChars(CharacterVector x, CharacterVector y) {
  int len = x.size();
  NumericVector out(len);
  double percentage;

  int count=0,k=0;
  std::string compared;
  std::string source;

  for (int i=0; i<len;++i) {
    source = x[i];
    compared = y[i];
    count=0;
    k=0;

    for (int j=0;j<compared.length();j++) {
      if (source[j] == compared[j]) { count++; continue; }

      while(k < source.length()) {
        if (source[j] == compared[k]) { count++; break; }
        k++;
      }
    }
    percentage = (count+0.0)/(source.length()+0.0);
    out[i] = percentage;
  }
  return out;
}')

Giving:

> commonChars(df$a,df$b)
[1] 1.0 0.7 1.0 0.9 0.7

I didn't bench it against other answers nor with large dataframe.

Not really what you're wishing but here's an idea (I'll try to improve it):

df$r <- gsub("\\w","(\1)?",df$a)
for (i in 1:length(df$a)) {
   df$percentage[i] < ( as.integer( 
                           attr( 
                             regexpr( df$r[i], df$b[i]), 
                             "match.length" 
                           ) 
                       ) / str_length(df$a[i]) * 100) 
}

Output:

               a          b                                        r percentage
1 NEWYORK001 NEWYORK001 (N)?(E)?(W)?(Y)?(O)?(R)?(K)?(0)?(0)?(1)?        100
2 ORLANDO002    ORLANDO (O)?(R)?(L)?(A)?(N)?(D)?(O)?(0)?(0)?(2)?         70
3  BOSTON003  BOSTON003     (B)?(O)?(S)?(T)?(O)?(N)?(0)?(0)?(3)?        100
4 CHICAGO004 CHICAGO005 (C)?(H)?(I)?(C)?(A)?(G)?(O)?(0)?(0)?(4)?         90
5 ATLANTA005 005ATLANTA (A)?(T)?(L)?(A)?(N)?(T)?(A)?(0)?(0)?(5)?         30

Drawbacks:

There's a for loop
ATLANTA005 return 30% because of the 005 matching only in the order.

I'll see if I can find a way to build a better regexp

Answer 2

I've arrived to fairly easy answer to my own question. And it is Levenshtein distance. Or adist() in R.

Long story short:

df$c <- 1 - diag(adist(df$a, df$b, fixed = F)) / apply(cbind(nchar(df$a), nchar(df$b)), 1, max)

This does the trick.

> df
           a          b   c
1 NEWYORK001 NEWYORK001 1.0
2 ORLANDO002    ORLANDO 0.7
3  BOSTON003  BOSTON003 1.0
4 CHICAGO004 CHICAGO005 0.9
5 ATLANTA005 005ATLANTA 0.7

Update:

Running the function on one of my data sets returns cute result (that made my inner nerd chuckle a bit):

Error: cannot allocate vector of size 1650.7 Gb

So, I guess it's another apply() loop for adist(), taking diagonal of the whole matrix is... well, fairly inefficient.

df$c <- 1 - apply(cbind(df$a, df$b),1, function(x) adist(x[1], x[2], fixed = F)) / apply(cbind(nchar(df$a), nchar(df$b)), 1, max)

This modification yields very satisfying results.

Answer 3

使用 stringdist 包，计算Damerau-Levenshtein距离：

#data
df <- read.table(text="
a          b
1 NEWYORK001 NEWYORK001
2 ORLANDO002    ORLANDO
3  BOSTON003  BOSTON003
4 CHICAGO004 CHICAGO005
5 ATLANTA005 005ATLANTA",stringsAsFactors = FALSE)


library(stringdist)
cbind(df, lavenshteinDist = stringsim(df$a, df$b))
#            a          b lavenshteinDist
# 1 NEWYORK001 NEWYORK001             1.0
# 2 ORLANDO002    ORLANDO             0.7
# 3  BOSTON003  BOSTON003             1.0
# 4 CHICAGO004 CHICAGO005             0.9
# 5 ATLANTA005 005ATLANTA             0.4

修改
有许多算法可以量化字符串相似性，您需要在数据上进行测试并选择合适的算法。以下是测试它们的代码：

#let's try all methods! do.call(rbind, lapply(c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), function(i) cbind(df, Method=i, Dist=stringsim(df$a, df$b,method = i)) ))

比较字符串向量和量化差异

3 个答案: