使用Levenshtein距离比较名称

时间:2018-04-24 21:02:09

标签: sql-server compare levenshtein-distance soundex

在我的应用程序中,我需要通过搜索他们的姓氏和名字来识别一个人。 一个要求是在一定程度上接受拼写错误

我尝试识别给出名字和姓氏的人是:

  1. sql query using soundex
  2. sql query using levenshtein-distance(LD),使用此LD-function
  3. 计算得出

    屏幕截图包含一些测试记录和我的sql查询结果,其中包括每列的索引值和LD

    compare records using levenshtein distance

    我当前的查询看起来像这样

    SELECT t2.*
            , t1.Firstname + ' ' + t1.Lastname as SourceName
            , 'Torsten Mueller' as TargetName
            , dbo.FUNC_LEVENSHTEIN(t1.Firstname +' '+ t1.Lastname
                                , 'Torsten Mueller', 8) as LEVENSHTEIN_Distance     
     FROM #TestSoundex t1
     LEFT JOIN #TestSoundex t2 ON t1.Id = t2.Id
     WHERE t1.Soundex_Firstname = SOUNDEX('Torsten')
           AND t1.Soundex_Lastname = SOUNDEX('Mueller')
    

    正如您所见,我首先通过soundex过滤结果并计算其余记录的levenshtein距离。在下面的这个样本中,levenshtein距离的范围从0(两个字符串相等)到3。

    SourceName       | TargetName      | Levenshtein Distance 
    Thorsten Müller  | Torsten Mueller |  3 
    Torsten Müller   | Torsten Mueller |  2
    Thorsten Mueller | Torsten Mueller |  1
    Torsten Mueller  | Torsten Mueller |  0
    

    In this talk from a stanford professor解释了距离的计算:

    I N T E * N TION 
    | | | | | | | 
    * E X E C U TION
    d s s   i s
    

    每次删除d,插入i增加1分,替换s增加2分。 我使用LD-function为上面的示例返回5个点,但Thorsten MüllerTorsten Mueller之间的距离仅返回3个而不是4个点。 我

    +1 point to delete h, 
    +1 point instead of 2 to substitute ü 
    +1 point to insert e
    

    所以我添加了一些样本

    Samples with Umlaut

    问题

    我的印象是soundex和LD都不足以唯一地识别给定firstnamelastname的人员记录,并考虑到可能存在拼写错配。

    • 您能否解释一下this LD-Function如何处理Umlaute ü,ö,ä以便我更好地理解计算?
    • 如果distance给定字符串st的第一个名字和姓氏的正确匹配,您建议作为numberOrCharacters(s+t)/2 = max的最大值,它应该基于长度两个字符串edit_distance_within

    源代码

    这是我在linked answer中使用的功能。我只将功能名称从FUNC_LEVENSHTEIN更改为SET QUOTED_IDENTIFIER ON GO SET ANSI_NULLS ON GO CREATE FUNCTION FUNC_LEVENSHTEIN(@s nvarchar(4000), @t nvarchar(4000), @d int) RETURNS int AS BEGIN DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int, @cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0 WHILE @j <= @tl SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1 WHILE @i <= @sl BEGIN SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000 WHILE @j <= @tl BEGIN SET @c = @c + 1 SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END IF @c > @c1 SET @c = @c1 SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1 IF @c > @c1 SET @c = @c1 IF @c < @cmin SET @cmin = @c SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1 END IF @cmin > @d BREAK SELECT @cv1 = @cv0, @i = @i + 1 END RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END END GO

    CREATE TABLE #TestLevenshteinDistance(
        Id int  IDENTITY(1,1) NOT NULL,
        SourceName nvarchar(100) NULL,  
        Soundex_SourceName varchar(4) NULL,    
        Targetname nvarchar(100) NULL, 
        Soundex_TargetName varchar(4) NULL, 
        );      
    
    INSERT INTO #TestLevenshteinDistance 
        (    SourceName,          
             Soundex_SourceName,
             Targetname,
             Soundex_TargetName) 
    VALUES 
       ('Intention',SOUNDEX('Intention'), 'Execution', SOUNDEX('Execution')),    
       ('Karsten' , SOUNDEX('Karsten'), 'Torsten', SOUNDEX('Torsten')); 
    
    
    SELECT t1.*
            , dbo.FUNC_LEVENSHTEIN(t1.SourceName, t1.Targetname, 8) as LEVENSHTEIN_Distance
            FROM #TestLevenshteinDistance t1
    

    上面测试函数的来源

    这是另一个测试

    @articles = Article.where(created_at: DateTime.now.beginning_of_day..DateTime.now.end_of_day)
    

0 个答案:

没有答案