改进TSQL暴力相似性查询

时间:2016-02-03 16:23:26

标签: tsql optimization nlp similarity

我需要将零件分配给各个组。每个组都有各种文本字符串,显示其职责范围。要检查的部分也由它们自己的文本字符串标识。目前,我正在针对每个组对每个部分进行强力相似性检查,并返回最佳相似性得分。当然,这很有效,但速度很慢。

我很欣赏任何不同的方式来看待这个。两个表中的文本字符串只是单词,我没有看到在进行检查之前组织这些文本字符串的方法,这样可以最小化我必须通过相似性代码运行的次数。

以下是一个例子:

IF OBJECT_ID('tempdb..#Parts') IS NOT NULL
    DROP TABLE #Parts

IF OBJECT_ID('tempdb..#DetailsTable') IS NOT NULL
    DROP TABLE #DetailsTable

CREATE TABLE #Parts(
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [PG] [varchar](50) NULL,
    [ML] [varchar](50) NULL,
    [Description] [varchar](80) NULL
)
GO

CREATE TABLE #DetailsTable(
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [Description] [varchar](80) NULL
)
GO

INSERT INTO #Parts(PG, ML, Description)
VALUES
('PA','001','Suspension-Leveling Sensor'),
('PA','001','Control Arm Bumper'),
('PB','002','Active Suspension'),
('PB','002','Air Suspension Ride Height Sensor'),
('PB','002','Suspension Control Arm and Ball Joint Assembly'),
('PC','003','Air Suspension Line Repair Kit'),
('PC','003','Electronic Air Suspension Compressor');
INSERT INTO #DetailsTable(Description)
VALUES
('ABSORBER-SUSPENSION'),
('STRUT-FRONTSUSPENSION'),
('STRUT-SUSPENSION'),
('ABSBR KIT-SUSPENSION'),
('ABSORBER-SUSPENSION'),
('AIR SUSPENSION STRUT'),
('BUSHING-SUSPENSION'),
('C/MEMBER-FRONT SUSPENSION'),
('KNUCKLE-SUSPENSION'),
('BALL-JOINT'),
('CONTROL-BUMPER')
;

DECLARE @iRow INT, @iRowRR INT, @count INT, @countRR INT;
DECLARE @tempStringML varchar(50), @tempStringPG varchar(50), @tempStringDescription varchar(50), @hiTempStringML varchar(50),
    @hiTempStringPG varchar(50), @hiTempDescription varchar(50),
    @Details varchar(500), @hiDetails varchar(500);
DECLARE @Jaccard FLOAT, @hiJaccard FLOAT;

SET @iRow = 1;
SET @iRowRR = 1;
SET @countRR = (SELECT count(#DetailsTable.Id) FROM #DetailsTable);
SET @count = (SELECT count(#Parts.Id) FROM #Parts);


WHILE @iRowRR <= @countRR
BEGIN
    SET @Details = (SELECT #DetailsTable.Description FROM #DetailsTable WHERE #DetailsTable.Id = @iRowRR);
    SET @iRow = 1;
    SET @hiJaccard = 0
    WHILE @iRow <= @count
        BEGIN
            /*establish loop structure*/
            SET @tempStringML = (SELECT #Parts.ML FROM #Parts WHERE #Parts.Id = @iRow);
            SET @tempStringDescription = (SELECT #Parts.Description FROM #Parts WHERE #Parts.Id = @iRow);
            SET @tempStringPG = (SELECT #Parts.PG FROM #Parts WHERE #Parts.Id = @iRow);
            SET @Jaccard = mdsdb.mdq.similarity(@Details,@tempStringDescription, 1, 0.85,0)
            IF(@Jaccard > @hiJaccard)
            BEGIN
                SET @hiJaccard = @Jaccard
                SET @hiTempDescription = @tempStringDescription
                SET @hiTempStringML = @tempStringML
                SET @hiTempStringPG = @tempStringPG
                SET @hiDetails = @Details
            END
            SET @iRow = @iRow + 1
        END;
    PRINT @hiTempStringPG + ' ' + @hiTempStringML + ' ' + @hiTempDescription + ' ' + @hiDetails + ' ' + CONVERT(varchar, @hiJaccard);
    SET @iRowRR = @iRowRR + 1;
END

更新20160220:

我的处理窗口不到60小时(15,000个字符串上升到70,000个字符串),但这不起作用。我走了一个不同的方向,在变量外添加了启动/停止,对所有工作表进行了临时操作,并将结果写入了一个公共输出表。我在相似性循环之外做了一些额外的工作来减轻负载并从sqlcmd脚本运行多个实例。

当我在7个实例中耗尽内存并且速度提高了6.4倍时,我变得受CPU限制。

@echo off
echo Do you want to delete RegResultsTeam?
set /p INPUT=""
cls
echo %INPUT%
If /I "%INPUT%"=="y" goto yes
If /I "%INPUT%"=="n" goto no
:yes
sqlcmd -d Similarity -Q "if exists (select [name] from sys.tables where  [name] = 'RegResultsTeam') DROP table RegResultsTeam"
:no
START sqlcmd  -d Similarity -i .\03-distrib-par.sql -v cycleStart=00001 cycleEnd=10000
START sqlcmd  -d Similarity -i .\03-distrib-par.sql -v cycleStart=10001 cycleEnd=20000
START sqlcmd  -d Similarity -i .\03-distrib-par.sql -v cycleStart=20001 cycleEnd=30000
START sqlcmd  -d Similarity -i .\03-distrib-par.sql -v cycleStart=30001 cycleEnd=40000
START sqlcmd  -d Similarity -i .\03-distrib-par.sql -v cycleStart=40001 cycleEnd=50000
START sqlcmd  -d Similarity -i .\03-distrib-par.sql -v cycleStart=50001 cycleEnd=60000
START sqlcmd  -d Similarity -i .\03-distrib-par.sql -v cycleStart=60001 cycleEnd=69632

感谢其他建议,感谢您的回复。

专利

1 个答案:

答案 0 :(得分:0)

尝试以下替代方法更快,但您需要比较每个匹配的文本 SQL - Similarity between two strings of varying length