计算SQL中两个字符串之间相似单词的数量

时间:2014-09-10 10:01:36

标签: sql sql-server string algorithm tsql

我打算写一个两个输入字符串的TSQL函数和一个单词相似度的百分比作为输出,例如:

SELECT [dbo].[FN_CalcSimilarWords]('Golden horses hotel','Hotel Golden Horses')

返回:

3/3

SELECT [dbo].[FN_CalcSimilarWords]('Golden horses','Golden horses Malaysia')

返回:

2/3

我在考虑将字符串解析后的单词循环和比较到This split function,还有其他任何想法可以获得更好的表现吗?

5 个答案:

答案 0 :(得分:2)

使用此解决方案,我假设您希望删除重复项。切换第一个和第二个参数对结果没有影响。

它返回一个值,而不是百分比,因为函数只能返回1个值或表。我假设您希望0到1之间的值为2/3 = 0.67或67%(如果乘以100)。

CREATE function f_functionx
(
  @str1 varchar(2000),
  @str2 varchar(2000)
)
returns decimal(5,2)
as
BEGIN
DECLARE @returnvalue decimal(5,2)
DECLARE @list1 table(value varchar(50))
INSERT @list1
SELECT t.c.value('.', 'VARCHAR(2000)')
FROM (
    SELECT x = CAST('<t>' + 
        REPLACE(@str1, ' ', '</t><t>') + '</t>' AS XML)
) a
CROSS APPLY x.nodes('/t') t(c)

DECLARE @list2 table(value varchar(50))
INSERT @list2
SELECT t.c.value('.', 'VARCHAR(2000)')
FROM (
    SELECT x = CAST('<t>' + 
        REPLACE(@str2, ' ', '</t><t>') + '</t>' AS XML)
) a
CROSS APPLY x.nodes('/t') t(c)


;WITH isect as
(
  SELECT count(*) match FROM
  (
    SELECT value FROM @list1
    INTERSECT
    SELECT value FROM @list2
  ) x
), total as
(
  SELECT max(cnt) cnt
  FROM
  (
    SELECT count(distinct value) cnt FROM @list1
    UNION ALL
    SELECT count(distinct value) FROM @list2
  ) x
)
SELECT 
  @returnvalue = cast(isect.match as decimal(9,2)) / total.cnt 
FROM total
CROSS JOIN isect

RETURN @returnvalue
END

GO

你可以这样调用这个函数:

SELECT dbo.f_functionx('Golden horses', 'Golden horses')
SELECT dbo.f_functionx('Golden horses', 'Golden horses XX')

返回:

1
0.67

答案 1 :(得分:1)

如果你想在SQL中这样做,我将采取一种方法。

使用分割例程创建两个临时表,称为Words1和Words2

现在加入表格并获取计数,即

select count(*) 
from Words1 w1 
join Words2 w2 on w1.word=w2.word

让SQL按照

进行优化的方式进行

以下是如何从两个表中获取计数

select count(distinct w1.word) as Matches,
       count(distinct w1.word) as FromW1,
       count(distinct w2.word) as FromW2
    from #Words1 w1 
    left join #Words2 w2 on w1.word=w2.word

答案 2 :(得分:1)

原始答案: SQL Fiddle

我在PTR Blog

看到了这种技巧

修改

修改以解决@ t-clausen.dk的评论中的问题:

SQL Fiddle

MS SQL Server 2012架构设置

CREATE TABLE StringTable 
(
    Id INT IDentity,
    String varchar(max)
)

INSERT INTO StringTable
VALUES ('xx xx Golden horses Malaysia'),
        ('xx xx xx xx xx')

查询1

WITH StringsCTE 
AS
(
    SELECT ID,String As StringValue, 
            CASE CHARINDEX(' ', String)
                WHEN 0 THEN String
                ELSE LEFT(String, CHARINDEX(' ',String) -1)
            END AS Word,
            1 as Position,
            CASE CHARINDEX(' ',String)
                WHEN 0 THEN ''
                ELSE RIGHT(String, LEN(String) - CHARINDEX(' ',String))
            END AS RestOfLine
    FROM StringTable
    UNION ALL

    SELECT Id,S.StringValue, 
            CASE CHARINDEX(' ',RestOfLine)
                WHEN 0 THEN RestOfLine
                ELSE LEFT(RestOfLine, CHARINDEX(' ',RestOfLine) -1)
            END, 
            Position + 1, 
            CASE CHARINDEX(' ',RestOfLine)
                WHEN 0 THEN ''
                ELSE RIGHT(RestOfLine, LEN(RestOfLine) - CHARINDEX(' ',RestOfLine))
            END
    FROM StringsCTE S
    WHERE s.RestOfLine != ''
),
WordsPerString
As
(
    SELECT S.Id, COUNT(s.Word) As NumberOfWords
    FROM StringsCTE S
    GROUP BY S.Id
)
SELECT COUNT(*) As Matches, (SELECT MAX(NumberOfWords) FROM WordsPerString) as Total
FROM StringsCTE S1
INNER JOIN StringsCTE S2
    ON S1.Word = S2.Word AND S1.Id <> S2.Id
WHERE S1.Id = 1 AND 
    NOT EXISTS -- Not already matched
  (SELECT * FROM StringsCTE S3 WHERE S3.Word = S2.Word AND S3.Id <> S1.ID AND S3.Position < S2.Position)

<强> Results

| MATCHES | TOTAL |
|---------|-------|
|       2 |     5 |

答案 3 :(得分:0)

如果您对部署CLR程序集没有任何限制,可以尝试使用此路由并比较性能。

答案 4 :(得分:0)

如果您不担心具有公共项目的确切数量,则可以使用SQL Server FullText搜索功能。 ContainsTableFREETEXT函数都返回Rank。详情请见

Full Text Ranking