我正在努力通过在包含AP记录的表上自我加入来查找AP系统中可能重复的付款/记录。一般来说,我有一个适用于相当大的数据集的模式。当我的匹配标准变宽并且每条记录的匹配数量过高时(每个记录有时20-100个匹配),就会出现问题。结果数据集可能很大,并且尝试获取distinct
值的时间太长。
在尝试查找供应商,发票号,日期和发票金额的匹配项时运行正常的查询:
SELECT DISTINCT DENSE_RANK() OVER(ORDER BY ap.GtInvoiceNumber,
ap.InvoiceDate, ap.InvoiceGrossAmount) PotentialDuplicate
,ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
FROM ApEntries ap
INNER JOIN ApEntries sap ON
ap.GtInvoiceNumber = sap.GtInvoiceNumber
AND ap.InvoiceDate = sap.InvoiceDate
AND sap.InvoiceGrossAmount = ap.InvoiceGrossAmount
AND ap.VendorNumber = sap.VendorNumber
AND ap.Id <> sap.Id
WHERE ap.ClientId = @ClientId
此查询在几秒钟内运行,因为符合条件的行数非常少,因此执行distinct
很简单。
在尝试查找供应商,日期和发票金额匹配时出现问题的查询(不包括匹配条件中的发票编号):
SELECT DISTINCT DENSE_RANK() OVER(ORDER BY ap.VendorNumber, ap.InvoiceDate, ap.InvoiceGrossAmount) PotentialDuplicate
,ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
FROM ApEntries ap
INNER JOIN ApEntries sap ON
ap.VendorNumber = sap.VendorNumber
AND ap.InvoiceDate = sap.InvoiceDate
AND ap.InvoiceGrossAmount = sap.InvoiceGrossAmount
AND (ap.Id <> sap.Id AND ap.TransactionId <> sap.TransactionId)
WHERE ap.ClientId = @ClientId
此查询大约需要3-4分钟才能运行,因为条件太宽。在SQL中有更好的解决方案吗?我认识到每场比赛将双向进行,我会得到每场比赛的记录,但也会记录匹配组内的每场比赛,因此一组5将产生30行,一组20将产生在420行,依此类推。我怎样才能提高效率呢?我想避免使用游标,但我认识到这可能是我唯一的选择。
感谢您花时间阅读本文。
表大约是760k条记录(所有这些记录都在这些查询中查询)。
正如您所看到的,我正在从具有760k记录的表中返回13,824,170行。这是由于某些群体很大(例如252个成员将导致63,756个记录)。
相关指数:
CREATE NONCLUSTERED INDEX [IX_ApEntries_DistinctIndex] ON [dbo].[ApEntries]
(
[ClientId] ASC
)
INCLUDE ( [Id],
[VendorNumber],
[InvoiceNumber],
[GtInvoiceNumber],
[InvoiceDate],
[PurchaseOrderNumber],
[InvoiceGrossAmount],
[InvoiceNetAmount],
[CheckDate],
[CheckNumber],
[TransactionId]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
CREATE NONCLUSTERED INDEX [IX_ApEntries_DupRep7] ON [dbo].[ApEntries]
(
[VendorNumber] ASC,
[InvoiceDate] ASC,
[InvoiceGrossAmount] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
CREATE UNIQUE NONCLUSTERED INDEX [IX_ApEntries_Xid] ON [dbo].[ApEntries]
(
[ClientId] ASC,
[TransactionId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
答案 0 :(得分:1)
为什么不试试ROW_NUMBER()
SELECT *
FROM (
SELECT ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
, row_number() over
(partition by VendorNumber, InvoiceDate, InvoiceGrossAmount
order by TransactionId) rn
WHERE ap.ClientId = @ClientId
) T
WHERE rn > 1;