各种标准的自我加入表的有效方式

时间:2016-05-18 18:28:32

标签: sql sql-server

背景

我正在努力通过在包含AP记录的表上自我加入来查找AP系统中可能重复的付款/记录。一般来说,我有一个适用于相当大的数据集的模式。当我的匹配标准变宽并且每条记录的匹配数量过高时(每个记录有时20-100个匹配),就会出现问题。结果数据集可能很大,并且尝试获取distinct值的时间太长。

当前解决方案

在尝试查找供应商,发票号,日期和发票金额的匹配项时运行正常的查询:

SELECT DISTINCT DENSE_RANK()  OVER(ORDER BY ap.GtInvoiceNumber,     
ap.InvoiceDate, ap.InvoiceGrossAmount) PotentialDuplicate
,ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
FROM ApEntries ap 
INNER JOIN ApEntries sap ON 
ap.GtInvoiceNumber = sap.GtInvoiceNumber 
AND ap.InvoiceDate = sap.InvoiceDate 
AND sap.InvoiceGrossAmount = ap.InvoiceGrossAmount 
AND ap.VendorNumber =  sap.VendorNumber
AND ap.Id <> sap.Id 
WHERE ap.ClientId = @ClientId

此查询在几秒钟内运行,因为符合条件的行数非常少,因此执行distinct很简单。

在尝试查找供应商,日期和发票金额匹配时出现问题的查询(不包括匹配条件中的发票编号):

SELECT DISTINCT DENSE_RANK()  OVER(ORDER BY ap.VendorNumber, ap.InvoiceDate, ap.InvoiceGrossAmount) PotentialDuplicate
,ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
FROM ApEntries ap 
INNER JOIN ApEntries sap ON 
ap.VendorNumber = sap.VendorNumber 
AND ap.InvoiceDate = sap.InvoiceDate 
AND ap.InvoiceGrossAmount =  sap.InvoiceGrossAmount
AND (ap.Id <> sap.Id AND ap.TransactionId <> sap.TransactionId)
WHERE ap.ClientId = @ClientId

此查询大约需要3-4分钟才能运行,因为条件太宽。在SQL中有更好的解决方案吗?我认识到每场比赛将双向进行,我会得到每场比赛的记录,但也会记录匹配组内的每场比赛,因此一组5将产生30行,一组20将产生在420行,依此类推。我怎样才能提高效率呢?我想避免使用游标,但我认识到这可能是我唯一的选择。

感谢您花时间阅读本文。

其他信息

表大约是760k条记录(所有这些记录都在这些查询中查询)。

查询错误查询的统计信息 enter image description here

正如您所看到的,我正在从具有760k记录的表中返回13,824,170行。这是由于某些群体很大(例如252个成员将导致63,756个记录)。

相关指数:

CREATE NONCLUSTERED INDEX [IX_ApEntries_DistinctIndex] ON [dbo].[ApEntries]
(
    [ClientId] ASC
)
INCLUDE (   [Id],
    [VendorNumber],
    [InvoiceNumber],
    [GtInvoiceNumber],
    [InvoiceDate],
    [PurchaseOrderNumber],
    [InvoiceGrossAmount],
    [InvoiceNetAmount],
    [CheckDate],
    [CheckNumber],
    [TransactionId]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE NONCLUSTERED INDEX [IX_ApEntries_DupRep7] ON [dbo].[ApEntries]
(
    [VendorNumber] ASC,
    [InvoiceDate] ASC,
    [InvoiceGrossAmount] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE UNIQUE NONCLUSTERED INDEX [IX_ApEntries_Xid] ON [dbo].[ApEntries]
(
    [ClientId] ASC,
    [TransactionId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

1 个答案:

答案 0 :(得分:1)

为什么不试试ROW_NUMBER()

SELECT *
FROM (
        SELECT ap.Id EntryId
             , ap.TransactionId
             , ap.VendorNumber
             , ap.InvoiceNumber
             , ap.GtInvoiceNumber
             , ap.InvoiceDate
             , ap.PurchaseOrderNumber
             , ap.InvoiceGrossAmount
             , ap.InvoiceNetAmount
             , ap.CheckNumber
             , ap.CheckDate
             , row_number() over 
               (partition by VendorNumber, InvoiceDate, InvoiceGrossAmount  
                order by TransactionId) rn
    WHERE ap.ClientId = @ClientId
   ) T
 WHERE rn > 1;