我需要将零件分配给各个组。每个组都有各种文本字符串,显示其职责范围。要检查的部分也由它们自己的文本字符串标识。目前,我正在针对每个组对每个部分进行强力相似性检查,并返回最佳相似性得分。当然,这很有效,但速度很慢。
我很欣赏任何不同的方式来看待这个。两个表中的文本字符串只是单词,我没有看到在进行检查之前组织这些文本字符串的方法,这样可以最小化我必须通过相似性代码运行的次数。
以下是一个例子:
IF OBJECT_ID('tempdb..#Parts') IS NOT NULL
DROP TABLE #Parts
IF OBJECT_ID('tempdb..#DetailsTable') IS NOT NULL
DROP TABLE #DetailsTable
CREATE TABLE #Parts(
[Id] [int] IDENTITY(1,1) NOT NULL,
[PG] [varchar](50) NULL,
[ML] [varchar](50) NULL,
[Description] [varchar](80) NULL
)
GO
CREATE TABLE #DetailsTable(
[Id] [int] IDENTITY(1,1) NOT NULL,
[Description] [varchar](80) NULL
)
GO
INSERT INTO #Parts(PG, ML, Description)
VALUES
('PA','001','Suspension-Leveling Sensor'),
('PA','001','Control Arm Bumper'),
('PB','002','Active Suspension'),
('PB','002','Air Suspension Ride Height Sensor'),
('PB','002','Suspension Control Arm and Ball Joint Assembly'),
('PC','003','Air Suspension Line Repair Kit'),
('PC','003','Electronic Air Suspension Compressor');
INSERT INTO #DetailsTable(Description)
VALUES
('ABSORBER-SUSPENSION'),
('STRUT-FRONTSUSPENSION'),
('STRUT-SUSPENSION'),
('ABSBR KIT-SUSPENSION'),
('ABSORBER-SUSPENSION'),
('AIR SUSPENSION STRUT'),
('BUSHING-SUSPENSION'),
('C/MEMBER-FRONT SUSPENSION'),
('KNUCKLE-SUSPENSION'),
('BALL-JOINT'),
('CONTROL-BUMPER')
;
DECLARE @iRow INT, @iRowRR INT, @count INT, @countRR INT;
DECLARE @tempStringML varchar(50), @tempStringPG varchar(50), @tempStringDescription varchar(50), @hiTempStringML varchar(50),
@hiTempStringPG varchar(50), @hiTempDescription varchar(50),
@Details varchar(500), @hiDetails varchar(500);
DECLARE @Jaccard FLOAT, @hiJaccard FLOAT;
SET @iRow = 1;
SET @iRowRR = 1;
SET @countRR = (SELECT count(#DetailsTable.Id) FROM #DetailsTable);
SET @count = (SELECT count(#Parts.Id) FROM #Parts);
WHILE @iRowRR <= @countRR
BEGIN
SET @Details = (SELECT #DetailsTable.Description FROM #DetailsTable WHERE #DetailsTable.Id = @iRowRR);
SET @iRow = 1;
SET @hiJaccard = 0
WHILE @iRow <= @count
BEGIN
/*establish loop structure*/
SET @tempStringML = (SELECT #Parts.ML FROM #Parts WHERE #Parts.Id = @iRow);
SET @tempStringDescription = (SELECT #Parts.Description FROM #Parts WHERE #Parts.Id = @iRow);
SET @tempStringPG = (SELECT #Parts.PG FROM #Parts WHERE #Parts.Id = @iRow);
SET @Jaccard = mdsdb.mdq.similarity(@Details,@tempStringDescription, 1, 0.85,0)
IF(@Jaccard > @hiJaccard)
BEGIN
SET @hiJaccard = @Jaccard
SET @hiTempDescription = @tempStringDescription
SET @hiTempStringML = @tempStringML
SET @hiTempStringPG = @tempStringPG
SET @hiDetails = @Details
END
SET @iRow = @iRow + 1
END;
PRINT @hiTempStringPG + ' ' + @hiTempStringML + ' ' + @hiTempDescription + ' ' + @hiDetails + ' ' + CONVERT(varchar, @hiJaccard);
SET @iRowRR = @iRowRR + 1;
END
更新20160220:
我的处理窗口不到60小时(15,000个字符串上升到70,000个字符串),但这不起作用。我走了一个不同的方向,在变量外添加了启动/停止,对所有工作表进行了临时操作,并将结果写入了一个公共输出表。我在相似性循环之外做了一些额外的工作来减轻负载并从sqlcmd脚本运行多个实例。
当我在7个实例中耗尽内存并且速度提高了6.4倍时,我变得受CPU限制。
@echo off
echo Do you want to delete RegResultsTeam?
set /p INPUT=""
cls
echo %INPUT%
If /I "%INPUT%"=="y" goto yes
If /I "%INPUT%"=="n" goto no
:yes
sqlcmd -d Similarity -Q "if exists (select [name] from sys.tables where [name] = 'RegResultsTeam') DROP table RegResultsTeam"
:no
START sqlcmd -d Similarity -i .\03-distrib-par.sql -v cycleStart=00001 cycleEnd=10000
START sqlcmd -d Similarity -i .\03-distrib-par.sql -v cycleStart=10001 cycleEnd=20000
START sqlcmd -d Similarity -i .\03-distrib-par.sql -v cycleStart=20001 cycleEnd=30000
START sqlcmd -d Similarity -i .\03-distrib-par.sql -v cycleStart=30001 cycleEnd=40000
START sqlcmd -d Similarity -i .\03-distrib-par.sql -v cycleStart=40001 cycleEnd=50000
START sqlcmd -d Similarity -i .\03-distrib-par.sql -v cycleStart=50001 cycleEnd=60000
START sqlcmd -d Similarity -i .\03-distrib-par.sql -v cycleStart=60001 cycleEnd=69632
感谢其他建议,感谢您的回复。
专利