SQL Server 2012 T-SQL计算两组元素之间的单词数

时间:2014-11-03 18:50:52

标签: sql sql-server text pattern-matching

我有两组元素,让我们说这些是:

  • 设置1:"核","裂变","脏"和
  • 设置2:"设备","爆炸"

在我的数据库中,我有一个文本列(Description),其中包含一两句话。我想找到Description包含集合1中的元素和集合2中的元素的任何记录,其中两个元素由四个或更少的单词分隔。为简单起见,计数(空格-1)将计算两个元素之间的单词。

如果解决方案不需要安装CLR函数以进行正则表达式,我更喜欢它。相反,如果可以使用用户定义的表函数完成此操作,则可以使部署更简单。

这听起来有可能吗?

3 个答案:

答案 0 :(得分:0)

我对表现一无所知,但这可以通过交叉申请和两个临时表来完成。

--initialize word set data
DECLARE @set1 TABLE (wordFromSet varchar(n))
DECLARE @set2 TABLE (wordFromSet varchar(n))

INSERT INTO @set1 SELECT 'nuclear' UNION SELECT 'fission' UNION SELECT 'dirty'
INSERT INTO @set2 SELECT 'device' UNION SELECT 'explosive' 

SELECT *
FROM MyTable m
CROSS APPLY
(
    SELECT wordFromSet
    ,LEN(SUBSTRING(m.Description, 1, CHARINDEX(wordFromSet, m.Description))) - LEN(REPLACE(SUBSTRING(m.Description, 1, CHARINDEX(wordFromSet, m.Description)),' ', '')) AS WordPosition
    FROM @set1
    WHERE m.Description LIKE '%' + wordFromSet + '%'
) w1
CROSS APPLY
(
    SELECT wordFromSet
    ,LEN(SUBSTRING(m.Description, 1, CHARINDEX(wordFromSet, m.Description))) - LEN(REPLACE(SUBSTRING(m.Description, 1, CHARINDEX(wordFromSet, m.Description)),' ', '')) AS WordPosition
    FROM @set2
    WHERE m.Description LIKE '%' + wordFromSet + '%'
) w2
WHERE w2.WordPosition - w1.WordPosition <= treshold

本质上它只返回MyTable中至少有两个单词中的一行的行,对于这些行,它将通过计算结束于单词位置的子字符串与之间的长度差来计算它所保持的单词位置。删除了空格的相同子字符串。

答案 1 :(得分:0)

这是可能的,但我认为它不会预成型数百万行。 我有一个解决方案,在我们的服务器上,在2秒内处理大约10 000行,在大约20秒内处理10万行。它还需要SQLServerCentral中着名的DelimitedSplit8K sql表函数:

DECLARE @set1 VARCHAR(MAX) = 'nuclear, fission, dirty';
DECLARE @set2 VARCHAR(MAX) = 'device, explosive';

WITH GetDistances AS 
(
    SELECT 
    DubID = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID)
    , Distance = dbo.[cf_ValueSetDistance](s.Description, @set1, @set2)
    , s.ID
    ,s.Description 
    FROM #sentences s
    JOIN dbo.cf_DelimitedSplit8K(@set1, ',') s1 ON s.Description LIKE '%' + RTRIM(LTRIM(s1.Item)) + '%'
    JOIN dbo.cf_DelimitedSplit8K(@set2, ',') s2 ON s.Description LIKE '%' + RTRIM(LTRIM(s2.Item)) + '%'
) SELECT Distance, ID, Description FROM GetDistances WHERE DubID = 1 AND Distance BETWEEN 1 AND 4;
--10 000 rows: 2sec
--100 000 rows: 20sec

enter image description here

测试数据生成器

--DROP TABLE #sentences
CREATE TABLE #sentences
(
    ID INT IDENTITY(1,1) PRIMARY KEY
    , Description VARCHAR(100)
);

GO
--CREATE 10000 random sentences that are 100 chars long
SET NOCOUNT ON;
WHILE((SELECT COUNT(*) FROM #sentences) < 10000)
BEGIN 
    DECLARE @randomWord VARCHAR(100) = '';
    SELECT TOP 100 @randomWord = @randomWord + ' ' + Item FROM  dbo.cf_DelimitedSplit8K('nuclear fission dirty device explosive On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain. These cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is untrammelled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted. The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains', ' ') ORDER BY NEWID();

    INSERT INTO #sentences
    SELECT @randomWord
END

SET NOCOUNT OFF;

功能1 - cf_ValueSetDistance

CREATE FUNCTION [dbo].[cf_ValueSetDistance]
(
    @value VARCHAR(MAX)
    , @compareSet1 VARCHAR(MAX)
    , @compareSet2 VARCHAR(MAX)
) 
RETURNS INT
AS
BEGIN

SET @value = REPLACE(REPLACE(REPLACE(@value, '.', ''), ',', ''), '?', '');
DECLARE @distance INT;

DECLARE @sentence TABLE( WordIndex INT, Word VARCHAR(MAX) );
DECLARE @set1 TABLE(Word VARCHAR(MAX) );
DECLARE @set2 TABLE(Word VARCHAR(MAX) );

INSERT INTO @sentence
SELECT ItemNumber, RTRIM(LTRIM(Item)) FROM dbo.cf_DelimitedSplit8K(@value, ' ')

INSERT INTO @set1
SELECT RTRIM(LTRIM(Item)) FROM dbo.cf_DelimitedSplit8K(@compareSet1, ',')

IF(EXISTS(SELECT 1 FROM @sentence s JOIN @set1 s1 ON s.Word = s1.Word))
BEGIN

    INSERT INTO @set2
    SELECT RTRIM(LTRIM(Item)) FROM dbo.cf_DelimitedSplit8K(@compareSet2, ',');

    IF(EXISTS(SELECT 1 FROM @sentence s JOIN @set2 s2 ON s.Word = s2.Word))
    BEGIN

        WITH Set1 AS (
            SELECT s.WordIndex, s.Word FROM @sentence s
            JOIN @set1 s1 ON s1.Word = s.Word
        ), Set2 AS
        (
            SELECT s.WordIndex, s.Word FROM @sentence s
            JOIN @set2 s2 ON s2.Word = s.Word
        )

        SELECT @distance = MIN(ABS(s2.WordIndex - s1.WordIndex)) FROM Set1 s1, Set2 s2  
    END

END

RETURN @distance;

END

功能2 - DelimitedSplit8K (甚至不需要尝试理解这段代码,这是一个非常快速的函数,用于将字符串拆分为表,由几个有才能的人编写):

CREATE FUNCTION [dbo].[cf_DelimitedSplit8K]
        (@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
 RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
     -- enough to cover NVARCHAR(4000)
  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
 cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
                     -- for both a performance gain and prevention of accidental "overruns"
                 SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                ),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
                 SELECT 1 UNION ALL
                 SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
                ),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
                 SELECT s.N1,
                        ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)
                   FROM cteStart s
                )
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
 SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
        Item       = SUBSTRING(@pString, l.N1, l.L1)
   FROM cteLen l;

答案 2 :(得分:0)

我正在添加一个新答案,即使我的旧答案已被接受,我也可以看到你选择了“全文本索引”。

我看了@Louis给出的答案,我认为使用“CROSS APPLY”很聪明。他的回答超过了我的表现。唯一的问题是他的代码只会从一个单词的第一个实例进行比较。这让我想尝试将他的答案与我使用的分割函数(来自SQLServerCentral的DelimitedSplit8K)结合起来。

这会带来显着的性能提升,我已经测试了100万行,结果几乎是即时的:

  • 我的回答是:5分钟
  • @Louis回答:2分钟
  • 新答案:3秒

这并不是明智的“FULLTEXT INDEX”,但它至少支持您以相对有效的方式提供的单词搜索组合规范。

DECLARE @set1 TABLE (Word VARCHAR(50))
DECLARE @set2 TABLE (Word VARCHAR(50))

INSERT INTO @set1 SELECT 'nuclear' UNION SELECT 'fission' UNION SELECT 'dirty'
INSERT INTO @set2 SELECT 'device'UNION SELECT 'explosive' 

SELECT * FROM #sentences s
CROSS APPLY
(
    SELECT * FROM @set1 s1
    JOIN dbo.cf_DelimitedSplit8K(s.Description, ' ') split ON split.Item = s1.Word
) s1
CROSS APPLY
(
    SELECT * FROM @set2 s2
    JOIN dbo.cf_DelimitedSplit8K(s.Description, ' ') split ON split.Item = s2.Word
) s2
WHERE ABS(s1.ItemNumber - s2.ItemNumber) <= 4;

查看我对dbo.cf_COM_DelimitedSplit8K函数代码的旧答案。