如何在不使用DISTINCT的情况下从存储过程中删除重复项

时间:2019-01-09 17:20:01

标签: sql sql-server tsql sql-server-2012

已编写一个包含重复项的存储过程。尝试了ROW_NUMBER,但无效。 DISTINCT工作正常,但无法检索所需的大量记录(约700,000条)。还有另一种使用RANK或GROUP BY删除重复项的方法吗?

我已经使用DISTINCT了,这没有检索到足够的记录。我尚未成功使用GROUP BY。

我尝试使用ROW NUMBER,但这也不起作用(您可以在其中看到注释)。

CREATE PROCEDURE [report].[get_foodDetails] 
    @foodgroup_id INT, 
    @shop_id INT = 0, 
    @product_id INT = 0, 
    @maxrows INT = 600, 
    @expiry INT = 1, 
    @productactive INT = 1, 
    @expiryPeriod DATETIME = '9999-12-31 23:59:59' 
AS 
    IF (@expiryPeriod >= '9999-12-31') 
    BEGIN 
        SET @expiryPeriod = GETDATE() 
    END 

    SELECT  
        -- dp.RowNumber 
        ISNULL([FoodType], '') AS [Foodtype],
        ISNULL([FoodColour], '') AS [FoodColour],
        ISNULL([FoodBarcode], '') AS [FoodBarcode],
        ISNULL([FoodArticleNum], 0) AS [FoodArticleNum],
        ISNULL([FoodShelfLife, '9999-21-31') AS [FoodShelfLIFe]
    INTO 
        #devfood 
    FROM 
        report.[GetOrderList] (@foodgroup_id, @product_id, @productactive, @expiry, @expiryPeriod, @shop_id, @maxrows ) dp 
    INNER JOIN 
        food_group fg ON fg.food_group_id = it.item_FK_item_group_id 

    SELECT TOP(@maxrows) * 
    FROM #devfood 
    ORDER BY [device_packet_created_date]  
 END 

检索到约700,000条记录。尽管有重复项,但目前已实现。使用DISTINCT时,只能检索到20,000个(但不能重复)。

2 个答案:

答案 0 :(得分:0)

下面的示例代码来自我用来演示CTE的演示文稿。这是删除重复项的常用机制,并且非常快。在这种情况下,重复项将从表中直接删除。如果这不是您的目标,则可以使用临时表或先前的链接CTE。请注意,重要的是分区依据的列。在此示例中,如果仅按[名称]进行分区,则不会同时看到红玫瑰和白玫瑰。

-------------------------------------------------
if object_id(N'[flower].[order]', N'U') is not null
  drop table [flower].[order];

go

create table [flower].[order]
  (
     [id]       int identity(1, 1) not null constraint [flower.order.id.clustered_primary_key] primary key clustered
     , [flower] nvarchar(128)
     , [color]  nvarchar(128)
     , [count]  int
  );

go

insert into [flower].[order]
            ([flower]
             , [color]
             , [count])
values      (N'rose',N'red',5),
            (N'rose',N'red',3),
            (N'rose',N'white',2),
            (N'rose',N'red',1),
            (N'rose',N'red',9),
            (N'marigold',N'yellow',2),
            (N'marigold',N'yellow',9),
            (N'marigold',N'yellow',4),
            (N'chamomile',N'amber',9),
            (N'chamomile',N'amber',4),
            (N'lily',N'white',12);

go

select [flower]
       , [color]
from   [flower].[order];

go

--
-------------------------------------------------
with [duplicate_finder]([name], [color], [sequence])
     as (select [flower]
                , [color]
                , row_number()
                    over (
                      partition by [flower], [color]
                      order by [flower] desc) as [sequence]
         from   [flower].[order])
delete from [duplicate_finder]
where  [sequence] > 1;

--
-- no duplicates
-------------------------------------------------
select [flower]
       , [color]
from   [flower].[order]; 

答案 1 :(得分:0)

我知道您说过您尝试过ROW_NUMBER,但是您是否尝试过以下两种方式?

首先,一个CTE。这里的CTE只是您现有的查询,但是附加了ROW_NUMBER窗口功能。对于记录的每个重复迭代,它将在RowNumber中添加一个。对于下一个唯一的记录组,RowNumber重置为1

拉取后,仅使用RowNumber = 1记录。我一直在用这种方法从基础记录集中删除重复对象,但也可以很好地识别它们。

WITH NoDupes AS
  (
    SELECT
      ROW_NUMBER() OVER (PARTITION BY
                           ISNULL(FoodType, '')
                          ,ISNULL(FoodColour, '')
                          ,ISNULL(FoodBarcode, '')
                          ,ISNULL(FoodArticleNum, '')
                          ,ISNULL(FoodShelfLife, '9999-21-31')
                         ORDER BY
                           (
                             SELECT
                               0
                           )
                        ) AS RowNumber
     ,ISNULL(FoodType, '') AS Foodtype
     ,ISNULL(FoodColour, '') AS FoodColour
     ,ISNULL(FoodBarcode, '') AS FoodBarcode
     ,ISNULL(FoodArticleNum, 0) AS FoodArticleNum
     ,ISNULL(FoodShelfLife, '9999-21-31') AS FoodShelfLIFe
    FROM
      report.GetOrderList(@foodgroup_id, @product_id, @productactive, @expiry, @expiryPeriod, @shop_id, @maxrows) AS dp
    INNER JOIN
      food_group AS fg
        ON
        fg.food_group_id = it.item_FK_item_group_id
  )
SELECT
  nd.Foodtype
 ,nd.FoodColour
 ,nd.FoodBarcode
 ,nd.FoodArticleNum
 ,nd.FoodShelfLIFe
INTO
  #devfood
FROM
  NoDupes AS nd
WHERE
  NoDupes.RowNumber = 1;

您也可以尝试SELECT TOP (1) WITH TIES,使用相同的ROW_NUMBER函数对记录集进行排序。 TOP (1) WITH TIES部分在功能上与CTE相同,只返回每组重复项的第一条记录。

SELECT
  TOP (1) WITH TIES
  ISNULL(FoodType, '') AS Foodtype
 ,ISNULL(FoodColour, '') AS FoodColour
 ,ISNULL(FoodBarcode, '') AS FoodBarcode
 ,ISNULL(FoodArticleNum, 0) AS FoodArticleNum
 ,ISNULL(FoodShelfLife, '9999-21-31') AS FoodShelfLIFe
INTO
  #devfood
FROM
  report.GetOrderList(@foodgroup_id, @product_id, @productactive, @expiry, @expiryPeriod, @shop_id, @maxrows) AS dp
INNER JOIN
  food_group AS fg
    ON
    fg.food_group_id = it.item_FK_item_group_id
ORDER BY
  ROW_NUMBER() OVER (PARTITION BY
                       ISNULL(FoodType, '')
                      ,ISNULL(FoodColour, '')
                      ,ISNULL(FoodBarcode, '')
                      ,ISNULL(FoodArticleNum, '')
                      ,ISNULL(FoodShelfLife, '9999-21-31')
                     ORDER BY
                       (
                         SELECT
                           0
                       )
                    );

对于下一个查看代码的人来说,CTE可能会更清晰一些,但是TOP的性能可能会好一些。