我有一张非常大的表,记录超过1000万。我希望根据匹配的字段和一些不匹配的字段找到重复项。
我目前使用的查询如下:
SELECT DISTINCT MainTable.[lineitemid]
FROM [dbo].[lineitem] MainTable
INNER JOIN [dbo].[lineitem] AS ChildTable
ON ChildTable.invoicedate = MainTable.invoicedate
AND LEFT(ChildTable.vendorname, 4) = LEFT(MainTable.vendorname, 4)
AND ChildTable.invoiceid <> MainTable.invoiceid AND -- Invoice ID column not matching
ChildTable.documentcurrencyamount = MainTable.documentcurrencyamount
WHERE ChildTable.lineitemid <> MainTable.lineitemid AND -- LineItemId is PK
MainTable.projectid = 1125 AND ChildTable.projectid = 1125 -- Duplicates should be identified with specific ProjectId
如果ProjectId的记录数低于100,000,则此查询正常工作。 当ProjectId记录超过100万时,在执行此查询时,tempdb大小最多可达100 GB并导致磁盘空间问题。查询将永远执行。
请帮我优化查询。
在获得上述查询的答案后添加以下行....
非常感谢,@ Gordon-Linoff。您建议的查询工作得更快。 VendorName来自不同的表。我可以包含内连接,如下所示吗?
SELECT li1.[LineItemId]
FROM [dbo].[LineItem] li1
INNER JOIN VendorMaster vm1 ON li1.VendorNumber=vm1.VendorNumber
AND vm1.CompanyCode = li1.CompanyCode
WHERE EXISTS (SELECT 1
FROM [dbo].[LineItem] as li2
INNER JOIN VendorMaster vm2 on li2.VendorNumber = vm2.VendorNumber
AND vm2.CompanyCode = li2.CompanyCode
WHERE li2.InvoiceDate = li.InvoiceDate and
LEFT(li2.VendorName, 4) = LEFT(li1.VendorName, 4) and
li2.InvoiceId <> li1.InvoiceId and -- Invoice ID column not matching
li2.DocumentCurrencyAmount = li1.DocumentCurrencyAmount and
li2.LineItemId <> li1.LineItemId and
li2.ProjectId = li1.ProjectId
li2.VendorNumber = li.VendorNumber)
AND li.ProjectId = 1125
这是一种有效的方法吗?
答案 0 :(得分:2)
运行此查询的一种较便宜的方法是使用exists
并免除distinct
:
SELECT li.[LineItemId]
FROM [dbo].[LineItem] li
WHERE EXISTS (SELECT 1
FROM [dbo].[LineItem] as li2 on
WHERE li2.InvoiceDate = li.InvoiceDate and
LEFT(li2.VendorName, 4) = LEFT(li.VendorName, 4) and
li2.InvoiceId <> li.InvoiceId and -- Invoice ID column not matching
li2.DocumentCurrencyAmount = li.DocumentCurrencyAmount and
li2.LineItemId <> li.LineItemId and
li2.ProjectId = li.ProjectId
WHERE MainTable.ProjectId = 1125;
对于性能,LineItem(ProjectId, InvoiceDate, DocumentCurrencyAmount, VendorName, InvoiceId, LineItemId)
上的索引会有所帮助。您可以通过将LEFT(LineItem.VendorName, 4)
声明为计算列并将其添加到VendorName
之前的索引来进一步加快查询速度。