标记重复记录的T-SQL查询

时间:2009-08-17 10:08:12

标签: tsql

我有一张表有500,000多条记录。 每个记录都有一个LineNumber字段,该字段不是唯一的,不是主键的一部分。 每条记录都有一个CreatedOn字段。

我需要更新所有500,000条记录以识别重复记录。

重复记录由在其CreatedOn字段的最后七天内具有相同LineNumber的记录定义。

alt text  

在上图中,第4行是重复,因为它仅在第1行发生了5天。 第6行不是重复,即使它仅在第4行发生四天,但第4行本身已经是重复,因此第6行只能与第6行之前的第1行进行比较,因此第6行不是重复。

我不知道如何更新IsRepeat字段,通过光标或其他东西逐步逐步浏览每个记录。

我不相信游标是可行的方式,但我坚持使用任何其他可能的解决方案。

我认为也许Common Table Expressions可能会有所帮助,但我对它们没有经验,也不知道从哪里开始。

基本上,每天都需要在桌面上完成相同的过程,因为每天都会截断并重新填充表格。重新填充表后,如果是重复,我必须重新标记每条记录。

非常感谢一些帮助。

更新

这是一个用于创建表并插入测试数据的脚本

USE [Test]
GO

/****** Object:  Table [dbo].[Job]    Script Date: 08/18/2009 07:55:25 ******/
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Job]') AND type in (N'U'))
DROP TABLE [dbo].[Job]
GO

USE [Test]
GO

/****** Object:  Table [dbo].[Job]    Script Date: 08/18/2009 07:55:25 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Job]') AND type in (N'U'))
BEGIN
CREATE TABLE [dbo].[Job](
    [JobID] [int] IDENTITY(1,1) NOT NULL,
    [LineNumber] [nvarchar](20) NULL,
    [IsRepeat] [bit] NULL,
    [CreatedOn] [smalldatetime] NOT NULL,
 CONSTRAINT [PK_Job] PRIMARY KEY CLUSTERED 
(
    [JobID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
END
GO


SET NOCOUNT ON

INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-01 07:52:08')
INSERT INTO dbo.Job VALUES ('1019',NULL,'2009-07-01 08:30:01')
INSERT INTO dbo.Job VALUES ('1028',NULL,'2009-07-01 09:30:35')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-01 10:51:10')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-02 09:22:30')
INSERT INTO dbo.Job VALUES ('1027',NULL,'2009-07-02 10:27:28')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-02 11:15:33')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-02 13:01:13')
INSERT INTO dbo.Job VALUES ('1014',NULL,'2009-07-03 12:05:56')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-03 13:57:34')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-03 15:38:54')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-04 16:32:20')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-05 13:46:46')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-05 15:08:35')
INSERT INTO dbo.Job VALUES ('1000',NULL,'2009-07-05 15:19:50')
INSERT INTO dbo.Job VALUES ('1011',NULL,'2009-07-05 16:37:19')
INSERT INTO dbo.Job VALUES ('1019',NULL,'2009-07-05 17:14:09')
INSERT INTO dbo.Job VALUES ('1009',NULL,'2009-07-05 20:55:08')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-06 08:29:29')
INSERT INTO dbo.Job VALUES ('1002',NULL,'2009-07-07 11:22:38')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-07 12:25:23')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-08 09:32:07')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-08 09:46:33')
INSERT INTO dbo.Job VALUES ('1016',NULL,'2009-07-08 10:09:08')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-09 10:45:04')
INSERT INTO dbo.Job VALUES ('1027',NULL,'2009-07-09 11:31:23')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-09 13:10:06')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-09 15:04:06')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-09 17:32:16')
INSERT INTO dbo.Job VALUES ('1012',NULL,'2009-07-09 19:51:28')
INSERT INTO dbo.Job VALUES ('1000',NULL,'2009-07-10 15:09:42')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-10 16:15:31')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-10 21:55:43')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-11 08:49:03')
INSERT INTO dbo.Job VALUES ('1022',NULL,'2009-07-11 16:47:21')
INSERT INTO dbo.Job VALUES ('1026',NULL,'2009-07-11 18:23:16')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-11 19:49:31')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-12 11:57:26')
INSERT INTO dbo.Job VALUES ('1003',NULL,'2009-07-13 08:32:20')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-13 09:31:32')
INSERT INTO dbo.Job VALUES ('1021',NULL,'2009-07-14 09:52:54')
INSERT INTO dbo.Job VALUES ('1021',NULL,'2009-07-14 11:22:31')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-14 11:54:14')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-14 15:17:08')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-15 13:27:08')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-15 14:10:56')
INSERT INTO dbo.Job VALUES ('1011',NULL,'2009-07-15 15:20:50')
INSERT INTO dbo.Job VALUES ('1028',NULL,'2009-07-15 15:39:18')
INSERT INTO dbo.Job VALUES ('1012',NULL,'2009-07-15 16:06:17')
INSERT INTO dbo.Job VALUES ('1017',NULL,'2009-07-16 11:52:08')

SET NOCOUNT OFF
GO

3 个答案:

答案 0 :(得分:1)

忽略LineNumber为空。在这种情况下应该如何处理IsRepeat?

适用于测试数据。它是否足以满足生产量的需求?

在对上重复(LineNumber,CreatedOn)的情况下,任意选择一个。 (具有最小JobId的那个)

基本理念:

  1. 获取所有JobId对 相隔至少七天 行号。
  2. 计算数量 行超过七天 从左侧,到达和 包括右侧。 (CNT)
  3. 然后我们知道JobId x是不是重复,下一个不是重复是X对的对 左侧,CNT = 1
  4. 使用递归CTE从每个LineNumber的第一行开始
  5. 递归元素使用带有计数的对来获取下一行。
  6. 最后更新,将所有IsRepeat设置为0表示非重复,1表示其他所有。

  7. ; with AllPairsByLineNumberAtLeast7DaysApart (LineNumber
                , LeftJobId
                , RightJobId
                , BeginCreatedOn
                , EndCreatedOn) as
            (select l.LineNumber
                , l.JobId
                , r.JobId
                , dateadd(day, 7, l.CreatedOn)
                , r.CreatedOn
            from Job l
            inner join Job r
                on l.LineNumber = r.LineNumber
                and dateadd(day, 7, l.CreatedOn) < r.CreatedOn
                and l.JobId <> r.JobId)
        -- Count the number of rows within from BeginCreatedOn 
        -- up to and including EndCreatedOn
        -- In the case of CreatedOn = EndCreatedOn, 
        -- include only jobId <= jobid, to handle ties in CreatedOn        
        , AllPairsCount(LineNumber, LeftJobId, RightJobId, Cnt) as
            (select ap.LineNumber, ap.LeftJobId, ap.RightJobId, count(*)
            from AllPairsByLineNumberAtLeast7DaysApart ap
            inner join Job j
                on j.LineNumber = ap.LineNumber
                and ap.BeginCreatedOn <= j.createdOn
                and (j.CreatedOn < ap.EndCreatedOn
                    or (j.CreatedOn = ap.EndCreatedOn 
                        and j.JobId <= ap.RightJobId))
             group by ap.LineNumber, ap.LeftJobId, ap.RightJobId)
        , Step1 (LineNumber, JobId, CreatedOn, RN) as
            (select LineNumber, JobId, CreatedOn
                , row_number() over 
                    (partition by LineNumber order by CreatedOn, JobId)
            from Job)
        , Results (JobId, LineNumber, CreatedOn) as    
            -- Start with the first rows.
            (select JobId, LineNumber, CreatedOn
            from Step1
            where RN = 1
            and LineNumber is not null
            -- get the next row
            union all
            select j.JobId, j.LineNumber, j.CreatedOn
            from Results r
            inner join AllPairsCount apc on apc.LeftJobId = r.JobId
            inner join Job j
                on j.JobId = apc.RightJobId
                and apc.CNT = 1)
        update j
        set IsRepeat = case when R.JobId is not null then 0 else 1 end
        from Job j
        left outer join Results r
            on j.JobId = R.JobId
        where j.LineNumber is not null
    

    修改

    我昨晚关掉电脑后意识到我的事情比他们需要的更复杂。更简单(并且在测试数据上,稍微有效)查询:

    基本理念:

    1. Generated PotentialStep(FromJobId,ToJobId)这些是如果FromJobId的对 不是重复,比ToJobId也不重复。 (LineNumber的第一行更多 来自FromJobId的七天以上)
    2. 使用递归CTE从每个LineNumber的第一个JobId开始,然后执行步骤, 使用PontentialSteps,每个非重复JobId

    3. ; with PotentialSteps (FromJobId, ToJobId) as
          (select FromJobId, ToJobId
          from (select f.JobId as FromJobId
                  , t.JobId as ToJobId
                  , row_number() over
                       (partition by f.LineNumber order by t.CreatedOn, t.JobId) as RN
              from Job f
              inner join Job t
                  on f.LineNumber = t.LineNumber
                  and dateadd(day, 7, f.CreatedOn) < t.CreatedOn) t
              where RN = 1)
      , NonRepeats (JobId) as
          (select JobId
          from (select JobId
                  , row_number() over
                      (partition by LineNumber order by CreatedOn, JobId) as RN
              from Job) Start
          where RN = 1
          union all
          select J.JobId
          from NonRepeats NR
          inner join PotentialSteps PS
              on NR.JobId = PS.FromJobId
          inner join Job J
              on PS.ToJobId = J.JobId)
      update J
      set IsRepeat = case when NR.JobId is not null then 0 else 1 end
      from Job J
      left outer join NonRepeats NR
      on J.JobId = NR.JobId
      where J.LineNumber is not null
      

答案 1 :(得分:-1)

UPDATE Jobs 
SET Jobs.IsRepeat = 0 -- mark all of them IsRepeat = false

UPDATE Jobs 
SET Jobs.IsRepeat = 1
WHERE EXISTS 
   (SELECT TOP 1 i.LineNumber FROM Jobs i WHERE i.LineNumber = Jobs.LineNumber
    AND i.CreatedOn <> Jobs.CreatedOn and i.CreatedOn BETWEEN Jobs.CreatedOn - 7 
    AND Jobs.CreatedOn)

注意:我希望这会对你有所帮助。请告诉我,如果您发现在较大的数据集上会遇到任何差异。

答案 2 :(得分:-2)

我并不为此感到骄傲,它做了很多假设(例如,CreatedOn只是日期,(LineNUmber,CreatedOn)是一个关键。可能需要一些调整,只适用于测试数据。

换句话说,我更多是为了求知欲而不是因为我认为这是一个真正的解决方案。最终选择可以是基于V4中行的存在而在基表中设置IsRepeat的更新。在让人们看到邪恶之前的最后一点 - 人们可以在评论中发布不适用的数据集的测试数据。有可能把它变成一个真正的解决方案:

with V1 as (
select t1.LineNumber,t1.CreatedOn,t2.CreatedOn as PrevDate from
T1 t1 inner join T1 t2 on t1.LineNumber = t2.LineNumber and t1.CreatedOn > t2.CreatedOn and DATEDIFF(DAY,t2.CreatedOn,t1.CreatedOn) < 7
), V2 as (
select v1.LineNumber,v1.CreatedOn,V1.PrevDate from V1
union all
select v1.LineNumber,v1.CreatedOn,v2.PrevDate from v1 inner join v2 on V1.LineNumber = v2.LineNumber and v1.PrevDate = v2.CreatedOn
), V3 as (
select LineNumber,CreatedOn,MIN(PrevDate) as PrevDate from V2 group by LineNumber,CreatedOn
), V4 as (
select LineNumber,CreatedOn from V3 where DATEDIFF(DAY,PrevDate,CreatedOn) < 7
)
select
    T1.LineNumber,
    T1.CreatedOn,
    CASE WHEN V4.LineNumber is Null then 0 else 1 end as IsRepeat
from
    T1
        left join
    V4
        on
            T1.LineNumber = V4.LineNumber and
            T1.CreatedOn = V4.CreatedOn
order by T1.CreatedOn,T1.LineNumber
option (maxrecursion 7)