SQL模糊连接 - MSSQL

时间:2016-08-31 13:35:42

标签: sql tsql fuzzy-search fuzzy-logic fuzzy-comparison

我有两组数据。现有客户和潜在客户。

我的主要目标是弄清楚任何潜在客户是否已经是现有客户。但是,跨数据集的客户的命名约定是不一致的。

现有客户

Customer /  ID
Ed's Barbershop /   1002
GroceryTown /   1003
Candy Place /   1004
Handy Man / 1005

可能的客户

Customer
Eds Barbershop
Grocery Town
Candy Place
Handee Man
Beauty Salon
The Apple Farm
Igloo Ice Cream
Ride-a-Long Bikes

我想写下面的某种类型的选择语句来实现我的目标:

SELECT a.Customer, b.ID
FROM PotentialCustomers a LEFT JOIN
     ExistingCustomers B
     ON a.Customer = b.Customer

结果看起来像:

Customer /  ID
Eds Barbershop  / 1002
Grocery Town    / 1003
Candy Place / 1004
Handee Man  / 1005
Beauty Salon /  NULL
The Apple Farm /    NULL
Igloo Ice Cream / NULL
Ride-a-Long Bikes / NULL

我对Levenshtein Distance和Double Metaphone的概念非常熟悉,但我不知道如何在这里应用它。

理想情况下,我希望SELECT语句的JOIN部分读取类似:LEFT JOIN ExistingCustomers as B WHERE a.Customer LIKE b.Customer但我知道语法不正确。

欢迎任何建议。谢谢!

4 个答案:

答案 0 :(得分:3)

尝试在SQL中执行此操作将是一个持续的挑战,并且您不可能获胜。您可以通过删除非az或0-9个字符或尝试SoundexMetaphone匹配或Levenshtein Distance之类的内容来远行,但总会有另一个在你所有的替换,野性梳理,拼音或平淡的捏造中你没有获得的边缘情况。

如果您确实找到了能够为您准确工作的内容,那么您将遇到性能问题。

简而言之,您最好的希望是沿着SQLCLR路线走下去,在路上学习很多C#或者根本不打算干什么,只需在源头清理数据或创建一个' clean&#查找表39;随着新变种的出现,需要不断维护的名称。

答案 1 :(得分:3)

Here is how this could be done using Levenshtein Distance:

Create this function:(Execute this first)

CREATE FUNCTION ufn_levenshtein(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
 DECLARE @s1_len int, @s2_len int
 DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
 DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)

 SELECT
  @s1_len = LEN(@s1),
  @s2_len = LEN(@s2),
  @cv1 = 0x0000,
  @j = 1, @i = 1, @c = 0

 WHILE @j <= @s2_len
  SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1

 WHILE @i <= @s1_len
 BEGIN
  SELECT
   @s1_char = SUBSTRING(@s1, @i, 1),
   @c = @i,
   @cv0 = CAST(@i AS binary(2)),
   @j = 1

  WHILE @j <= @s2_len
  BEGIN
   SET @c = @c + 1
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
    CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
   IF @c > @c_temp SET @c = @c_temp
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
   IF @c > @c_temp SET @c = @c_temp
   SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
 END

 SELECT @cv1 = @cv0, @i = @i + 1
 END

 RETURN @c
END

(Function developped by Joseph Gama)

And then simply use this query to get matches

SELECT A.Customer,
       b.ID,
       b.Customer
FROM #POTENTIALCUSTOMERS a
     LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 5;

Complete Script after you create that function:

IF OBJECT_ID('tempdb..#ExistingCustomers') IS NOT NULL
    DROP TABLE #ExistingCustomers;

CREATE TABLE #ExistingCustomers
(Customer VARCHAR(255),
 ID       INT
);

INSERT INTO #ExistingCustomers
VALUES
('Ed''s Barbershop',
 1002
);

INSERT INTO #ExistingCustomers
VALUES
('GroceryTown',
 1003
);

INSERT INTO #ExistingCustomers
VALUES
('Candy Place',
 1004
);

INSERT INTO #ExistingCustomers
VALUES
('Handy Man',
 1005
);

IF OBJECT_ID('tempdb..#POTENTIALCUSTOMERS') IS NOT NULL
    DROP TABLE #POTENTIALCUSTOMERS;

CREATE TABLE #POTENTIALCUSTOMERS(Customer VARCHAR(255));

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Eds Barbershop');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Grocery Town');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Candy Place');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Handee Man');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Beauty Salon');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('The Apple Farm');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Igloo Ice Cream');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Ride-a-Long Bikes');

SELECT A.Customer,
       b.ID,
       b.Customer
FROM #POTENTIALCUSTOMERS a
     LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 5;

Here you can find a T-SQL example at http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx

答案 2 :(得分:2)

一种方法是在比较列的两侧使用REPLACE函数的帮助。

SELECT a.Customer, b.ID
FROM PotentialCustomers a 
  LEFT JOIN ExistingCustomers B
     ON (LTRIM(RTRIM(REPLACE(REPLACE(REPLACE(a.Customer,' ',''),'-',''),'''',''))) = LTRIM(RTRIM(REPLACE(REPLACE(REPLACE(b.Customer,' ',''),'-',''),'''','')))) 
        OR (a.Customer LIKE '%'+b.Customer+'%') 
        OR (b.Customer LIKE '%'+a.Customer+'%') 

答案 3 :(得分:0)

您需要多于1个字段才能完成此任务。你有城市,州,邮编,地址等的东西吗?然后,您可以创建一个多部分键,并将这些字段连接起来。您可能希望将某些字符截断为前5个字符或其他内容,但您获得的误报越多,您获得的误报就越多。

我已经完成了这项工作并创建了几个密钥,每个密钥的限制性较小。然后匹配尝试每个键并在找到匹配项时指定匹配等级。