我打算写一个两个输入字符串的TSQL函数和一个单词相似度的百分比作为输出,例如:
SELECT [dbo].[FN_CalcSimilarWords]('Golden horses hotel','Hotel Golden Horses')
返回:
3/3
或
SELECT [dbo].[FN_CalcSimilarWords]('Golden horses','Golden horses Malaysia')
返回:
2/3
我在考虑将字符串解析后的单词循环和比较到This split function,还有其他任何想法可以获得更好的表现吗?
答案 0 :(得分:2)
使用此解决方案,我假设您希望删除重复项。切换第一个和第二个参数对结果没有影响。
它返回一个值,而不是百分比,因为函数只能返回1个值或表。我假设您希望0到1之间的值为2/3 = 0.67或67%(如果乘以100)。
CREATE function f_functionx
(
@str1 varchar(2000),
@str2 varchar(2000)
)
returns decimal(5,2)
as
BEGIN
DECLARE @returnvalue decimal(5,2)
DECLARE @list1 table(value varchar(50))
INSERT @list1
SELECT t.c.value('.', 'VARCHAR(2000)')
FROM (
SELECT x = CAST('<t>' +
REPLACE(@str1, ' ', '</t><t>') + '</t>' AS XML)
) a
CROSS APPLY x.nodes('/t') t(c)
DECLARE @list2 table(value varchar(50))
INSERT @list2
SELECT t.c.value('.', 'VARCHAR(2000)')
FROM (
SELECT x = CAST('<t>' +
REPLACE(@str2, ' ', '</t><t>') + '</t>' AS XML)
) a
CROSS APPLY x.nodes('/t') t(c)
;WITH isect as
(
SELECT count(*) match FROM
(
SELECT value FROM @list1
INTERSECT
SELECT value FROM @list2
) x
), total as
(
SELECT max(cnt) cnt
FROM
(
SELECT count(distinct value) cnt FROM @list1
UNION ALL
SELECT count(distinct value) FROM @list2
) x
)
SELECT
@returnvalue = cast(isect.match as decimal(9,2)) / total.cnt
FROM total
CROSS JOIN isect
RETURN @returnvalue
END
GO
你可以这样调用这个函数:
SELECT dbo.f_functionx('Golden horses', 'Golden horses')
SELECT dbo.f_functionx('Golden horses', 'Golden horses XX')
返回:
1
0.67
答案 1 :(得分:1)
如果你想在SQL中这样做,我将采取一种方法。
使用分割例程创建两个临时表,称为Words1和Words2
现在加入表格并获取计数,即
select count(*)
from Words1 w1
join Words2 w2 on w1.word=w2.word
让SQL按照
进行优化的方式进行以下是如何从两个表中获取计数
select count(distinct w1.word) as Matches,
count(distinct w1.word) as FromW1,
count(distinct w2.word) as FromW2
from #Words1 w1
left join #Words2 w2 on w1.word=w2.word
答案 2 :(得分:1)
原始答案: SQL Fiddle
我在PTR Blog
看到了这种技巧修改强>
修改以解决@ t-clausen.dk的评论中的问题:
MS SQL Server 2012架构设置:
CREATE TABLE StringTable
(
Id INT IDentity,
String varchar(max)
)
INSERT INTO StringTable
VALUES ('xx xx Golden horses Malaysia'),
('xx xx xx xx xx')
查询1 :
WITH StringsCTE
AS
(
SELECT ID,String As StringValue,
CASE CHARINDEX(' ', String)
WHEN 0 THEN String
ELSE LEFT(String, CHARINDEX(' ',String) -1)
END AS Word,
1 as Position,
CASE CHARINDEX(' ',String)
WHEN 0 THEN ''
ELSE RIGHT(String, LEN(String) - CHARINDEX(' ',String))
END AS RestOfLine
FROM StringTable
UNION ALL
SELECT Id,S.StringValue,
CASE CHARINDEX(' ',RestOfLine)
WHEN 0 THEN RestOfLine
ELSE LEFT(RestOfLine, CHARINDEX(' ',RestOfLine) -1)
END,
Position + 1,
CASE CHARINDEX(' ',RestOfLine)
WHEN 0 THEN ''
ELSE RIGHT(RestOfLine, LEN(RestOfLine) - CHARINDEX(' ',RestOfLine))
END
FROM StringsCTE S
WHERE s.RestOfLine != ''
),
WordsPerString
As
(
SELECT S.Id, COUNT(s.Word) As NumberOfWords
FROM StringsCTE S
GROUP BY S.Id
)
SELECT COUNT(*) As Matches, (SELECT MAX(NumberOfWords) FROM WordsPerString) as Total
FROM StringsCTE S1
INNER JOIN StringsCTE S2
ON S1.Word = S2.Word AND S1.Id <> S2.Id
WHERE S1.Id = 1 AND
NOT EXISTS -- Not already matched
(SELECT * FROM StringsCTE S3 WHERE S3.Word = S2.Word AND S3.Id <> S1.ID AND S3.Position < S2.Position)
<强> Results 强>:
| MATCHES | TOTAL |
|---------|-------|
| 2 | 5 |
答案 3 :(得分:0)
如果您对部署CLR程序集没有任何限制,可以尝试使用此路由并比较性能。
答案 4 :(得分:0)
如果您不担心具有公共项目的确切数量,则可以使用SQL Server FullText搜索功能。 ContainsTable
和FREETEXT
函数都返回Rank。详情请见