我有一个查询,我遍历一个表 - >对于每个条目,我遍历另一个表,然后计算一些结果。我使用游标迭代表。此查询需要很长时间才能完成。总是超过3分钟。如果我在C#中做类似的事情,其中表是数组或字典,它甚至不需要一秒钟。我做错了什么,如何提高效率?
DELETE FROM [QueryScores]
GO
INSERT INTO [QueryScores] (Id)
SELECT Id FROM [Documents]
DECLARE @Id NVARCHAR(50)
DECLARE myCursor CURSOR LOCAL FAST_FORWARD FOR
SELECT [Id] FROM [QueryScores]
OPEN myCursor
FETCH NEXT FROM myCursor INTO @Id
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @Score FLOAT = 0.0
DECLARE @CounterMax INT = (SELECT COUNT(*) FROM [Query])
DECLARE @Counter INT = 0
PRINT 'Document: ' + CAST(@Id AS VARCHAR)
PRINT 'Score: ' + CAST(@Score AS VARCHAR)
WHILE @Counter < @CounterMax
BEGIN
DECLARE @StemId INT = (SELECT [Query].[StemId] FROM [Query] WHERE [Query].[Id] = @Counter)
DECLARE @Weight FLOAT = (SELECT [tfidf].[Weight] FROM [TfidfWeights] AS [tfidf] WHERE [tfidf].[StemId] = @StemId AND [tfidf].[DocumentId] = @Id)
PRINT 'WEIGHT: ' + CAST(@Weight AS VARCHAR)
IF(@Weight > 0.0)
BEGIN
DECLARE @QWeight FLOAT = (SELECT [Query].[Weight] FROM [Query] WHERE [Query].[StemId] = @StemId)
SET @Score = @Score + (@QWeight * @Weight)
PRINT 'Score: ' + CAST(@Score AS VARCHAR)
END
SET @Counter = @Counter + 1
END
UPDATE [QueryScores] SET Score = @Score WHERE Id = @Id
FETCH NEXT FROM myCursor INTO @Id
END
CLOSE myCursor
DEALLOCATE myCursor
逻辑是我有一份文档列表。我有一个问题/疑问。我遍历每个文档,然后通过查询术语/单词进行嵌套迭代,以查找文档是否包含这些术语。如果是,那么我添加/乘以预先计算的分数。
答案 0 :(得分:7)
问题在于您尝试使用基于集合的语言来迭代过程语言之类的东西。 SQL需要不同的思维方式。你应该几乎从不考虑SQL中的循环。
从我可以从你的代码中收集到的内容,这应该做你在所有这些循环中尝试做的事情,但是它是在基于集合的方式的单个语句中完成的,这就是SQL擅长。
INSERT INTO QueryScores (id, score)
SELECT
D.id,
SUM(CASE WHEN W.[Weight] > 0 THEN W.[Weight] * Q.[Weight] ELSE NULL END)
FROM
Documents D
CROSS JOIN Query Q
LEFT OUTER JOIN TfidfWeights W ON W.StemId = Q.StemId AND W.DocumentId = D.id
GROUP BY
D.id
当然,如果没有您的要求描述或具有预期输出的样本数据,我不知道这实际上是否是您想要获得的,但这是我最好的猜测,因为您的代码。
答案 1 :(得分:1)
我提出的查询与Tom H的查询非常相似。
OP代码试图解决的问题有很多未知数。是否有一个特殊原因,代码只检查Query
表中Id
值介于0和1之间的行在表中的行数?或者意图真的只是为了获取Query
的所有行?
这是我的版本:
INSERT INTO QueryScores (Id, Score)
SELECT d.Id
, SUM(CASE WHEN w.Weight > 0 THEN w.Weight * q.Weight ELSE NULL END) AS Score
FROM [Documents] d
CROSS
JOIN [Query] q
LEFT
JOIN [TfidfWeights] w
ON w.StemId = q.StemId
AND w.DocumentId = d.Id
GROUP BY d.Id
处理RBAR(通过痛苦的行排)几乎总是比作为一组处理慢。 SQL旨在对数据集进行操作。每个单独的SQL语句以及过程和SQL引擎之间的每个上下文切换都有开销。当然,可能有提高程序各个部分性能的空间,但是在单个SQL语句中,最大的好处就是对整个集合进行操作。
如果出于某种原因你需要一次处理一个文档,使用游标,然后摆脱循环和个别选择以及所有那些PRINT,只需使用一个查询来获得分数文件。
OPEN myCursor
FETCH NEXT FROM myCursor INTO @Id
WHILE @@FETCH_STATUS = 0
BEGIN
UPDATE [QueryScores]
SET Score
= ( SELECT SUM( CASE WHEN w.Weight > 0
THEN w.Weight * q.Weight
ELSE NULL END
)
FROM [Query] q
JOIN [TfidfWeights] w
ON w.StemId = q.StemId
WHERE w.DocumentId = @Id
)
WHERE Id = @Id
FETCH NEXT FROM myCursor INTO @Id
END
CLOSE myCursor
DEALLOCATE myCursor
答案 2 :(得分:1)
您甚至可能不需要文件
INSERT INTO QueryScores (id, score)
SELECT W.DocumentId as [id]
, SUM(W.[Weight] + Q.[Weight]) as [score]
FROM Query Q
JOIN TfidfWeights W
ON W.StemId = Q.StemId
AND W.[Weight] > 0
GROUP BY W.DocumentId