是否可以在SQL Server 2008中搜索类似的单词?
如果用户输入:Ayrton Sena
只有一个'n'
它还应该返回Ayrton Senna
行,其中包含两个'nn'
我认为同样的方法适用于拼写检查单词
答案 0 :(得分:2)
由于“塞纳”不是“塞纳”的反映,因此很难使用全文索引来解决此问题。
我建议使用全文和字符串相似性的组合来判断两个字符串是否被认为是“相等”。
因此,如果您搜索多个单词并允许其中一个单词拼写错误,请使用以下内容
select *
from myTable t
join FullTextTable(myTable,TextField,'Ayrton Senna') f
on f.ID=t.PK
where dbo.MyExternalStringSimilarity('Ayrton Senna', t.TextField)>0.9
现在您只需要一个字符串相似度函数。您可以使用Microsoft数据质量服务中的“相似性”功能或编写自己的功能。
寻找Jaro-Winkler,Levenshtein,Dice-Coefficient等。这些是进行字符串相似性比较的好算法。
当然您也可以使用
扫描整个数据库select *
from myTable t
where dbo.MyExternalStringSimilarity('Ayrton Senna', t.TextField)>0.9
但这可能需要很长时间才能完成。
编辑:但是,我们目前正在使用第一种方法来查找名称的所有类似拼写。它很棒。
答案 1 :(得分:1)
我一直在研究类似的问题,偶然发现了1990年创建的“ metaphone”算法。它实际上是Soundex的更准确版本,可用于识别在语音上相似的单词。它在某些编程语言中显示为内置函数。 我们一直在使用的Here's a SQL Server equivalent function by 'Phil Factor'取得了一些成功。它是对php's inbuilt metaphone function进行反向工程的。
我在下面粘贴了一个重新格式化的版本,以便于阅读代码。
IF OBJECT_ID('Utils.Metaphone','FN') IS NOT NULL --drop any existing metaphone function
DROP FUNCTION Utils.Metaphone
GO
CREATE FUNCTION Utils.Metaphone
(
@String VARCHAR(30)
)
RETURNS VARCHAR(10)
AS
BEGIN
DECLARE @New BIT
,@ii INT
,@Metaphone VARCHAR(28)
,@Len INT
,@Where INT;
DECLARE @This CHAR
,@Next CHAR
,@Following CHAR
,@Previous CHAR
,@Silent BIT;
SELECT @String = UPPER(LTRIM(COALESCE(@String, ''))); --trim and upper case
SELECT @Where = PATINDEX ('%[^A-Z]%', @String COLLATE Latin1_General_CI_AI )
WHILE @Where > 0 --strip out all non-alphabetic characters!
BEGIN
SELECT @String = STUFF(@string, @Where, 1, '')
SELECT @Where = PATINDEX ('%[^A-Z]%',@String COLLATE Latin1_General_CI_AI )
END
IF (LEN(@String) < 2) RETURN @String
--do the start of string stuff first.
--If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
-- "Aebersold", "Gnagy", "Knuth", "Pniewski", "Wright"
IF SUBSTRING(@String, 1, 2) IN ( 'KN', 'GN', 'PN', 'AE', 'WR' )
SELECT @String = STUFF(@String, 1, 1, '');
-- Beginning of word: "x" change to "s" as in "Deng Xiaopeng"
IF SUBSTRING(@String, 1, 1) = 'X'
SELECT @String = STUFF(@String, 1, 1, 'S');
-- Beginning of word: "wh-" change to "w" as in "Whatsoever"
IF @String LIKE 'WH%'
SELECT @String = STUFF(@String, 1, 1, 'W');
-- Set up for While loop
SELECT @Len = LEN(@String), @Metaphone = '' -- Initialize the main variable
,@New = 1 -- this variable only used next 10 lines!!!
,@ii = 1; --Position counter
--
WHILE((LEN(@Metaphone) <= 8) AND (@ii <= @Len))
BEGIN --SET up the 'pointers' for this loop-around }
SELECT @Previous = CASE WHEN @ii > 1 THEN SUBSTRING(@String, @ii - 1, 1) ELSE '' END
-- originally a nul terminated string }
,@This = SUBSTRING(@String, @ii, 1)
,@Next = CASE WHEN @ii < @Len THEN SUBSTRING(@String, @ii + 1, 1) ELSE '' END
,@Following = CASE WHEN((@ii + 1) < @Len) THEN SUBSTRING(@String, @ii + 2, 1) ELSE '' END
-- 'CC' inside word
/* Drop duplicate adjacent letters, except for C.*/
IF @This=@Previous AND @This<> 'C'
BEGIN
--we do nothing
SELECT @New=0
END
/*Drop all vowels unless it is the beginning.*/
ELSE IF @This IN ( 'A', 'E', 'I', 'O', 'U' )
BEGIN
IF @ii = 1 --vowel at the beginning
SELECT @Metaphone = @This;
/* B -> B unless at the end of word after "m", as in "dumb", "Comb" */
END;
ELSE IF @This = 'B' AND NOT ((@ii = @Len) AND (@Previous = 'M'))
BEGIN
SELECT @Metaphone = @Metaphone + 'B';
END;
-- -mb is silent
/*'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-',
in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'.
Otherwise, 'C' transforms to 'K'.*/
ELSE IF @This = 'C'
BEGIN -- -sce, i, y = silent
IF NOT (@Previous= 'S') AND (@Next IN ( 'H', 'E', 'I', 'Y' )) --front vowel set
BEGIN
IF (@Next = 'I') AND (@Following = 'A')
SELECT @Metaphone = @Metaphone + 'X'; -- -cia-
ELSE IF(@Next IN ( 'E', 'I', 'Y' ))
SELECT @Metaphone = @Metaphone + 'S'; -- -ce, i, y = 'S' }
ELSE IF(@Next = 'H') AND (@Previous = 'S')
SELECT @Metaphone = @Metaphone + 'K'; -- -sch- = 'K' }
ELSE IF(@Next = 'H')
BEGIN
IF(@ii = 1) AND ((@ii + 2) <= @Len) AND NOT(@Following IN ( 'A', 'E', 'I', 'O', 'U' ))
SELECT @Metaphone = @Metaphone + 'K';
ELSE
SELECT @Metaphone = @Metaphone + 'X';
END
END
ELSE
SELECT @Metaphone = @Metaphone +CASE WHEN @Previous= 'S' THEN '' ELSE 'K' END;
-- Else silent
END; -- Case C }
/*'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D'
transforms to 'T'.*/
ELSE IF @This = 'D'
BEGIN
SELECT @Metaphone = @Metaphone
+ CASE WHEN(@Next = 'G') AND (@Following IN ( 'E', 'I', 'Y' )) --front vowel set
THEN 'J'
ELSE 'T'
END;
END;
ELSE IF @This = 'G'
/*Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G'
if followed by 'N' or 'NED' and is at the end.
'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'.
Otherwise, 'G' transforms to 'K'.*/
BEGIN
SELECT @Silent = CASE WHEN (@Next = 'H')
AND (@Following IN ('A','E','I','O','U'))
AND (@ii > 1)
AND (
((@ii+1) = @Len)
OR
(
(@Next = 'n')
AND (@Following = 'E')
AND SUBSTRING(@String,@ii+3,1) = 'D'
)
AND ((@ii+3) = @Len)
)
-- Terminal -gned
AND (@Previous = 'i')
AND (@Next = 'n')
THEN 1
-- if not start and near -end or -gned.)
WHEN (@ii > 1)
AND (@Previous = 'D')-- gnuw
AND (@Next IN ('E','I','Y')) --front vowel set
THEN 1 -- -dge, i, y
ELSE 0
END
IF NOT(@Silent=1)
SELECT @Metaphone = @Metaphone
+ CASE WHEN (@Next IN ('E','I','Y')) --front vowel set
THEN 'J'
ELSE 'K'
END
END
/*Drop 'H' if after vowel and not before a vowel.
or the second char of "-ch-", "-sh-", "-ph-", "-th-", "-gh-"*/
ELSE IF @This = 'H'
BEGIN
IF NOT( (@ii= @Len) OR (@Previous IN ( 'C', 'S', 'T', 'G' )))
AND (@Next IN ( 'A', 'E', 'I', 'O', 'U' ) )
SELECT @Metaphone = @Metaphone + 'H';
-- else silent (vowel follows) }
END;
ELSE IF @This IN --some get no substitution
( 'F', 'J', 'L', 'M', 'N', 'R' )
BEGIN
SELECT @Metaphone = @Metaphone + @This;
END;
/*'CK' transforms to 'K'.*/
ELSE IF @This = 'K'
BEGIN
IF (@Previous <> 'C')
SELECT @Metaphone = @Metaphone + 'K';
END;
/*'PH' transforms to 'F'.*/
ELSE IF @This = 'P'
BEGIN
IF(@Next = 'H')
SELECT @Metaphone = @Metaphone + 'F', @ii = @ii + 1;
-- Skip the 'H'
ELSE
SELECT @Metaphone = @Metaphone + 'P';
END;
/*'Q' transforms to 'K'.*/
ELSE IF @This = 'Q'
BEGIN
SELECT @Metaphone = @Metaphone + 'K';
END;
/*'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.*/
ELSE IF @This = 'S'
BEGIN
SELECT @Metaphone = @Metaphone
+ CASE WHEN (@Next = 'H')
OR
(
(@ii> 1)
AND (@Next = 'i')
AND (@Following IN ( 'O', 'A' ) )
)
THEN 'X'
ELSE 'S'
END;
END;
/*'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms
to '0'. Drop 'T' if followed by 'CH'.*/
ELSE IF @This = 'T'
BEGIN
SELECT @Metaphone = @Metaphone
+ CASE WHEN (@ii = 1)
AND (@Next = 'H')
AND (@Following = 'O')
THEN 'T' -- Initial Tho- }
WHEN (@ii > 1)
AND (@Next = 'i')
AND (@Following IN ( 'O', 'A' ))
THEN 'X'
WHEN (@Next = 'H')
THEN '0'
WHEN NOT((@Next = 'C')
AND (@Following = 'H'))
THEN 'T'
ELSE ''
END;
-- -tch = silent }
END;
/*'V' transforms to 'F'.*/
ELSE IF @This = 'V'
BEGIN
SELECT @Metaphone = @Metaphone + 'F';
END;
/*'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.*/
/*Drop 'Y' if not followed by a vowel.*/
ELSE IF @This IN ( 'W', 'Y' )
BEGIN
IF @Next IN ( 'A', 'E', 'I', 'O', 'U' )
SELECT @Metaphone = @Metaphone + @This;
--else silent
/*'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.*/
END;
ELSE IF @This = 'X'
BEGIN
SELECT @Metaphone = @Metaphone + 'KS';
END;
/*'Z' transforms to 'S'.*/
ELSE IF @This = 'Z'
BEGIN
SELECT @Metaphone = @Metaphone + 'S';
END;
ELSE
RETURN 'error with '''+ @This+ '''';
-- end
SELECT @ii = @ii + 1;
END; -- While
RETURN @Metaphone
END
以下所有测试都会产生相同的结果。
SELECT Utils.Metaphone('Aryton Sena')
,Utils.Metaphone('Aryton Senna')
,Utils.Metaphone('Ayrton Senna')
,Utils.Metaphone('Ayrton Sena')
,Utils.Metaphone('Aryten Sena')
,Utils.Metaphone('Aryten Senna')
,Utils.Metaphone('Ayrten Senna')
,Utils.Metaphone('Ayrten Sena');
结果:
ARTNSN
答案 2 :(得分:0)
看看Full Text Search。这允许各种搜索,包括不同的单词形式。您可以配置单词表单或使用开箱即用的词典。
引用(强调我的):
全文查询对文本数据执行语言搜索 基于规则操作单词和短语的全文索引 特定语言,如英语或日语。全文查询 可以包括简单的单词和短语或单词的多种形式或 短语。
关于词库,请参阅this answer。
答案 3 :(得分:0)
拼写检查程序通常使用字典来查找单词。如果您的单词与字典中的单词完全匹配,则拼写正确。如果没有,则找到最接近的匹配,并建议将其替换。有些拼写检查器会使用其他拼写或常见的错误拼写,但这并没有从根本上改变他们的工作方式。
Jaro-Winkler是一种距离测量,因为它测量两个单词之间的“距离”,即必须进行多少次换位才能从第一个单词到第二个单词。 Jaro通常用于匹配人名,因为这是它擅长的。它也可以用于更一般的匹配,但你需要注意缩写等,因为这些可能会混淆它。
性能不应成为问题。我通常在.NET应用程序中实现Jaro Winkler算法,因为编写SQL UDF很棘手。您还可以使用外部CLR存储过程吗?当匹配成千上万条记录时,这表现良好。如果您可能会匹配数百万个名称,那么性能可能更受关注?
以下是您如何处理此问题的示例: http://isolvable.blogspot.co.uk/2011/05/jaro-winkler-fast-fuzzy-linkage.html