想法是比较两个字符串向量,例如:
select
img1.image_blob as blobA,
img2.image_blob as blobB,
img3.image_blob as blobC,
from price_history ph
left join imagedata img1 on img1.id = ph.image_name_homePage
left join imagedata img2 on img2.id = ph.image_name_loginPage
left join imagedata img3 on img3.id = ph.image_name_footer
并提出一种方法来给他们一些精确度。基本上添加具有数值的列c。
我的思路:
我们有:
df <- data.frame(a = c("New York 001", "Orlando 002", "Boston 003", "Chicago 004", "Atlanta 005"),
b = c("NEW YORK 001", "Orlando", "Boston (003)", "Chicago 005", "005 Atlanta"))
第一件事是第一件事 - 剥离白人,忽略案件并删除所有特殊字符,而我们就是这样。
> df
a b
1 New York 001 NEW YORK 001
2 Orlando 002 Orlando
3 Boston 003 Boston (003)
4 Chicago 004 Chicago 005
5 Atlanta 005 005 Atlanta
我们得到了什么:
df$a <- gsub("[[:space:]]|[[:punct:]]", "", toupper(df$a))
df$b <- gsub("[[:space:]]|[[:punct:]]", "", toupper(df$b))
所以现在我们处于问题的核心。
第一行将是100%匹配。 第二行有7个匹配字符,最多10个字符。因此70%。 第三名现在比赛为100%。 第四名有90%的比赛。 第五个是棘手的。人类的头脑告诉我他们匹配,但订单存在问题。但那并不是计算机的工作原理。实际上它可以被测量为70%匹配,因为在两个字符串中重复7个连续字符。
所以问题是:
如何对字符串比较进行定量测量?
也许有更好的方法可以做到这一点,因为我从未有过比较部分匹配字符串集的经验。并且采用这种特殊的可量化措施只是我直观的做事方式。 如果R已经拥有一个能够以更好的方式完成所有这一切的库/功能,我根本不会感到惊讶。
答案 0 :(得分:7)
A more correct answer with Rcpp:
library(Rcpp)
cppFunction('NumericVector commonChars(CharacterVector x, CharacterVector y) {
int len = x.size();
NumericVector out(len);
double percentage;
int count=0,k=0;
std::string compared;
std::string source;
for (int i=0; i<len;++i) {
source = x[i];
compared = y[i];
count=0;
k=0;
for (int j=0;j<compared.length();j++) {
if (source[j] == compared[j]) { count++; continue; }
while(k < source.length()) {
if (source[j] == compared[k]) { count++; break; }
k++;
}
}
percentage = (count+0.0)/(source.length()+0.0);
out[i] = percentage;
}
return out;
}')
Giving:
> commonChars(df$a,df$b)
[1] 1.0 0.7 1.0 0.9 0.7
I didn't bench it against other answers nor with large dataframe.
Not really what you're wishing but here's an idea (I'll try to improve it):
df$r <- gsub("\\w","(\1)?",df$a)
for (i in 1:length(df$a)) {
df$percentage[i] < ( as.integer(
attr(
regexpr( df$r[i], df$b[i]),
"match.length"
)
) / str_length(df$a[i]) * 100)
}
Output:
a b r percentage
1 NEWYORK001 NEWYORK001 (N)?(E)?(W)?(Y)?(O)?(R)?(K)?(0)?(0)?(1)? 100
2 ORLANDO002 ORLANDO (O)?(R)?(L)?(A)?(N)?(D)?(O)?(0)?(0)?(2)? 70
3 BOSTON003 BOSTON003 (B)?(O)?(S)?(T)?(O)?(N)?(0)?(0)?(3)? 100
4 CHICAGO004 CHICAGO005 (C)?(H)?(I)?(C)?(A)?(G)?(O)?(0)?(0)?(4)? 90
5 ATLANTA005 005ATLANTA (A)?(T)?(L)?(A)?(N)?(T)?(A)?(0)?(0)?(5)? 30
Drawbacks:
ATLANTA005
return 30% because of the 005 matching only in the order.I'll see if I can find a way to build a better regexp
答案 1 :(得分:6)
I've arrived to fairly easy answer to my own question. And it is Levenshtein distance. Or adist()
in R.
Long story short:
df$c <- 1 - diag(adist(df$a, df$b, fixed = F)) / apply(cbind(nchar(df$a), nchar(df$b)), 1, max)
This does the trick.
> df
a b c
1 NEWYORK001 NEWYORK001 1.0
2 ORLANDO002 ORLANDO 0.7
3 BOSTON003 BOSTON003 1.0
4 CHICAGO004 CHICAGO005 0.9
5 ATLANTA005 005ATLANTA 0.7
Update:
Running the function on one of my data sets returns cute result (that made my inner nerd chuckle a bit):
Error: cannot allocate vector of size 1650.7 Gb
So, I guess it's another apply()
loop for adist()
, taking diagonal of the whole matrix is... well, fairly inefficient.
df$c <- 1 - apply(cbind(df$a, df$b),1, function(x) adist(x[1], x[2], fixed = F)) / apply(cbind(nchar(df$a), nchar(df$b)), 1, max)
This modification yields very satisfying results.
答案 2 :(得分:4)
使用 stringdist 包,计算Damerau-Levenshtein距离:
#data
df <- read.table(text="
a b
1 NEWYORK001 NEWYORK001
2 ORLANDO002 ORLANDO
3 BOSTON003 BOSTON003
4 CHICAGO004 CHICAGO005
5 ATLANTA005 005ATLANTA",stringsAsFactors = FALSE)
library(stringdist)
cbind(df, lavenshteinDist = stringsim(df$a, df$b))
# a b lavenshteinDist
# 1 NEWYORK001 NEWYORK001 1.0
# 2 ORLANDO002 ORLANDO 0.7
# 3 BOSTON003 BOSTON003 1.0
# 4 CHICAGO004 CHICAGO005 0.9
# 5 ATLANTA005 005ATLANTA 0.4
修改强>
有许多算法可以量化字符串相似性,您需要在数据上进行测试并选择合适的算法。以下是测试它们的代码:
#let's try all methods!
do.call(rbind,
lapply(c("osa", "lv", "dl", "hamming", "lcs", "qgram",
"cosine", "jaccard", "jw", "soundex"),
function(i)
cbind(df, Method=i, Dist=stringsim(df$a, df$b,method = i))
))