在我的应用程序中,我需要通过搜索他们的姓氏和名字来识别一个人。 一个要求是在一定程度上接受拼写错误。
我尝试识别给出名字和姓氏的人是:
屏幕截图包含一些测试记录和我的sql查询结果,其中包括每列的索引值和LD
我当前的查询看起来像这样
SELECT t2.*
, t1.Firstname + ' ' + t1.Lastname as SourceName
, 'Torsten Mueller' as TargetName
, dbo.FUNC_LEVENSHTEIN(t1.Firstname +' '+ t1.Lastname
, 'Torsten Mueller', 8) as LEVENSHTEIN_Distance
FROM #TestSoundex t1
LEFT JOIN #TestSoundex t2 ON t1.Id = t2.Id
WHERE t1.Soundex_Firstname = SOUNDEX('Torsten')
AND t1.Soundex_Lastname = SOUNDEX('Mueller')
正如您所见,我首先通过soundex过滤结果并计算其余记录的levenshtein距离。在下面的这个样本中,levenshtein距离的范围从0(两个字符串相等)到3。
SourceName | TargetName | Levenshtein Distance
Thorsten Müller | Torsten Mueller | 3
Torsten Müller | Torsten Mueller | 2
Thorsten Mueller | Torsten Mueller | 1
Torsten Mueller | Torsten Mueller | 0
In this talk from a stanford professor解释了距离的计算:
I N T E * N TION
| | | | | | |
* E X E C U TION
d s s i s
每次删除d
,插入i
增加1分,替换s
增加2分。
我使用LD-function
为上面的示例返回5个点,但Thorsten Müller
和Torsten Mueller
之间的距离仅返回3个而不是4个点。
我
+1 point to delete h,
+1 point instead of 2 to substitute ü
+1 point to insert e
所以我添加了一些样本
我的印象是soundex和LD都不足以唯一地识别给定firstname
和lastname
的人员记录,并考虑到可能存在拼写错配。
ü,ö,ä
以便我更好地理解计算?distance
给定字符串s
和t
的第一个名字和姓氏的正确匹配,您建议作为numberOrCharacters(s+t)/2 = max
的最大值,它应该基于长度两个字符串edit_distance_within
? 这是我在linked answer中使用的功能。我只将功能名称从FUNC_LEVENSHTEIN
更改为SET QUOTED_IDENTIFIER ON
GO
SET ANSI_NULLS ON
GO
CREATE FUNCTION FUNC_LEVENSHTEIN(@s nvarchar(4000), @t nvarchar(4000), @d int)
RETURNS int
AS
BEGIN
DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
@cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0
WHILE @j <= @tl
SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
WHILE @i <= @sl
BEGIN
SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000
WHILE @j <= @tl
BEGIN
SET @c = @c + 1
SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END
IF @c > @c1 SET @c = @c1
SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
IF @c > @c1 SET @c = @c1
IF @c < @cmin SET @cmin = @c
SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1
END
IF @cmin > @d BREAK
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
END
GO
CREATE TABLE #TestLevenshteinDistance(
Id int IDENTITY(1,1) NOT NULL,
SourceName nvarchar(100) NULL,
Soundex_SourceName varchar(4) NULL,
Targetname nvarchar(100) NULL,
Soundex_TargetName varchar(4) NULL,
);
INSERT INTO #TestLevenshteinDistance
( SourceName,
Soundex_SourceName,
Targetname,
Soundex_TargetName)
VALUES
('Intention',SOUNDEX('Intention'), 'Execution', SOUNDEX('Execution')),
('Karsten' , SOUNDEX('Karsten'), 'Torsten', SOUNDEX('Torsten'));
SELECT t1.*
, dbo.FUNC_LEVENSHTEIN(t1.SourceName, t1.Targetname, 8) as LEVENSHTEIN_Distance
FROM #TestLevenshteinDistance t1
这是另一个测试
@articles = Article.where(created_at: DateTime.now.beginning_of_day..DateTime.now.end_of_day)