首页 文章

各种标准的自我加入表的有效方式

提问于
浏览
2

背景

我正在努力通过在包含AP记录的表上自行加入来在AP系统中找到可能重复的付款/记录 . 一般来说,我有一个适用于相当大的数据集的模式 . 当我的匹配标准变宽并且每条记录的匹配数量过高时(每个记录有时20-100个匹配),就会出现问题 . 结果数据集可能很大,并且试图获取 distinct 值需要太长时间 .

当前解决方案

尝试查找供应商,发票号,日期和发票金额的匹配项时运行正常的查询:

SELECT DISTINCT DENSE_RANK()  OVER(ORDER BY ap.GtInvoiceNumber,     
ap.InvoiceDate, ap.InvoiceGrossAmount) PotentialDuplicate
,ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
FROM ApEntries ap 
INNER JOIN ApEntries sap ON 
ap.GtInvoiceNumber = sap.GtInvoiceNumber 
AND ap.InvoiceDate = sap.InvoiceDate 
AND sap.InvoiceGrossAmount = ap.InvoiceGrossAmount 
AND ap.VendorNumber =  sap.VendorNumber
AND ap.Id <> sap.Id 
WHERE ap.ClientId = @ClientId

此查询在几秒钟内运行,因为符合条件的行数非常低,因此执行 distinct 很简单 .

尝试查找供应商,日期和发票金额匹配时出现问题的查询(不包括匹配条件中的发票编号):

SELECT DISTINCT DENSE_RANK()  OVER(ORDER BY ap.VendorNumber, ap.InvoiceDate, ap.InvoiceGrossAmount) PotentialDuplicate
,ap.Id EntryId
, ap.TransactionId
, ap.VendorNumber
, ap.InvoiceNumber
, ap.GtInvoiceNumber
, ap.InvoiceDate
, ap.PurchaseOrderNumber
, ap.InvoiceGrossAmount
, ap.InvoiceNetAmount
, ap.CheckNumber
, ap.CheckDate
FROM ApEntries ap 
INNER JOIN ApEntries sap ON 
ap.VendorNumber = sap.VendorNumber 
AND ap.InvoiceDate = sap.InvoiceDate 
AND ap.InvoiceGrossAmount =  sap.InvoiceGrossAmount
AND (ap.Id <> sap.Id AND ap.TransactionId <> sap.TransactionId)
WHERE ap.ClientId = @ClientId

此查询大约需要3-4分钟才能运行,因为条件太宽 . 在SQL中有更好的解决方案吗?我认识到每场比赛将双向进行,我会得到每场比赛的记录,但也会记录匹配组内的每场比赛,因此一组5将产生30行,一组20将产生在420行,依此类推 . 我怎样才能提高效率呢?我想避免使用游标,但我认识到这可能是我唯一的选择 .

感谢您抽时间阅读 .

其他信息

表大约是760k条记录(所有这些记录都在这些查询中查询) .

查询错误查询的统计信息
enter image description here

正如你所看到的,我从一个有760k记录的表中返回了13,824,170行 . 这是由于某些群体很大(例如252个成员将导致63,756个记录) .

相关指数:

CREATE NONCLUSTERED INDEX [IX_ApEntries_DistinctIndex] ON [dbo].[ApEntries]
(
    [ClientId] ASC
)
INCLUDE (   [Id],
    [VendorNumber],
    [InvoiceNumber],
    [GtInvoiceNumber],
    [InvoiceDate],
    [PurchaseOrderNumber],
    [InvoiceGrossAmount],
    [InvoiceNetAmount],
    [CheckDate],
    [CheckNumber],
    [TransactionId]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE NONCLUSTERED INDEX [IX_ApEntries_DupRep7] ON [dbo].[ApEntries]
(
    [VendorNumber] ASC,
    [InvoiceDate] ASC,
    [InvoiceGrossAmount] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE UNIQUE NONCLUSTERED INDEX [IX_ApEntries_Xid] ON [dbo].[ApEntries]
(
    [ClientId] ASC,
    [TransactionId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

1 回答

  • 1

    为什么不试试 ROW_NUMBER()

    SELECT *
    FROM (
            SELECT ap.Id EntryId
                 , ap.TransactionId
                 , ap.VendorNumber
                 , ap.InvoiceNumber
                 , ap.GtInvoiceNumber
                 , ap.InvoiceDate
                 , ap.PurchaseOrderNumber
                 , ap.InvoiceGrossAmount
                 , ap.InvoiceNetAmount
                 , ap.CheckNumber
                 , ap.CheckDate
                 , row_number() over 
                   (partition by VendorNumber, InvoiceDate, InvoiceGrossAmount  
                    order by TransactionId) rn
        WHERE ap.ClientId = @ClientId
       ) T
     WHERE rn > 1;
    

相关问题