背景

Question

背景

我正在努力通过在包含AP记录的表上自我加入来查找AP系统中可能重复的付款/记录。一般来说，我有一个适用于相当大的数据集的模式。当我的匹配标准变宽并且每条记录的匹配数量过高时（每个记录有时20-100个匹配），就会出现问题。结果数据集可能很大，并且尝试获取distinct值的时间太长。

当前解决方案

在尝试查找供应商，发票号，日期和发票金额的匹配项时运行正常的查询：

SELECT DISTINCT DENSE_RANK()  OVER(ORDER BY ap.GtInvoiceNumber,     
ap.InvoiceDate, ap.InvoiceGrossAmount) PotentialDuplicate
,ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
FROM ApEntries ap 
INNER JOIN ApEntries sap ON 
ap.GtInvoiceNumber = sap.GtInvoiceNumber 
AND ap.InvoiceDate = sap.InvoiceDate 
AND sap.InvoiceGrossAmount = ap.InvoiceGrossAmount 
AND ap.VendorNumber =  sap.VendorNumber
AND ap.Id <> sap.Id 
WHERE ap.ClientId = @ClientId

此查询在几秒钟内运行，因为符合条件的行数非常少，因此执行distinct很简单。

在尝试查找供应商，日期和发票金额匹配时出现问题的查询（不包括匹配条件中的发票编号）：

SELECT DISTINCT DENSE_RANK()  OVER(ORDER BY ap.VendorNumber, ap.InvoiceDate, ap.InvoiceGrossAmount) PotentialDuplicate
,ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
FROM ApEntries ap 
INNER JOIN ApEntries sap ON 
ap.VendorNumber = sap.VendorNumber 
AND ap.InvoiceDate = sap.InvoiceDate 
AND ap.InvoiceGrossAmount =  sap.InvoiceGrossAmount
AND (ap.Id <> sap.Id AND ap.TransactionId <> sap.TransactionId)
WHERE ap.ClientId = @ClientId

此查询大约需要3-4分钟才能运行，因为条件太宽。在SQL中有更好的解决方案吗？我认识到每场比赛将双向进行，我会得到每场比赛的记录，但也会记录匹配组内的每场比赛，因此一组5将产生30行，一组20将产生在420行，依此类推。我怎样才能提高效率呢？我想避免使用游标，但我认识到这可能是我唯一的选择。

感谢您花时间阅读本文。

其他信息

表大约是760k条记录（所有这些记录都在这些查询中查询）。

查询错误查询的统计信息

正如您所看到的，我正在从具有760k记录的表中返回13,824,170行。这是由于某些群体很大（例如252个成员将导致63,756个记录）。

相关指数：

CREATE NONCLUSTERED INDEX [IX_ApEntries_DistinctIndex] ON [dbo].[ApEntries]
(
    [ClientId] ASC
)
INCLUDE (   [Id],
    [VendorNumber],
    [InvoiceNumber],
    [GtInvoiceNumber],
    [InvoiceDate],
    [PurchaseOrderNumber],
    [InvoiceGrossAmount],
    [InvoiceNetAmount],
    [CheckDate],
    [CheckNumber],
    [TransactionId]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE NONCLUSTERED INDEX [IX_ApEntries_DupRep7] ON [dbo].[ApEntries]
(
    [VendorNumber] ASC,
    [InvoiceDate] ASC,
    [InvoiceGrossAmount] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE UNIQUE NONCLUSTERED INDEX [IX_ApEntries_Xid] ON [dbo].[ApEntries]
(
    [ClientId] ASC,
    [TransactionId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

Answer 1

为什么不试试ROW_NUMBER()

SELECT *
FROM (
        SELECT ap.Id EntryId
             , ap.TransactionId
             , ap.VendorNumber
             , ap.InvoiceNumber
             , ap.GtInvoiceNumber
             , ap.InvoiceDate
             , ap.PurchaseOrderNumber
             , ap.InvoiceGrossAmount
             , ap.InvoiceNetAmount
             , ap.CheckNumber
             , ap.CheckDate
             , row_number() over 
               (partition by VendorNumber, InvoiceDate, InvoiceGrossAmount  
                order by TransactionId) rn
    WHERE ap.ClientId = @ClientId
   ) T
 WHERE rn > 1;

各种标准的自我加入表的有效方式

背景

当前解决方案

其他信息

1 个答案: