需要处理大型数据集的小子集的方法

时间:2010-10-05 14:07:22

标签: sql sql-server tsql

我正面临一个概念性的问题,我很难克服。我希望SO人们可以帮助我克服正确的方向。

我正在进行一些ETL工作,源数据非常相似且非常大。我将它加载到一个用于复制的表中,我只想要这个目标表中最基本的信息。

我的源表看起来像这样:

alt text

我需要我的目标表来反映它:

alt text

正如您所看到的,我没有复制在源表中重复的InTransit状态。我试图弄清楚如何实现的步骤是

  1. 获取自上次查询运行以来输入的任何新的不同行。 (易)
  2. 对于每个TrackingId,我需要检查每个新状态是否已经是目标中的最新状态,如果是,则忽略否则继续并插入它。这意味着我必须从最早的新状态开始并从那里开始。 (我没有*(!#在线索我将如何做到这一点)
  3. 每隔15分钟执行此操作,以便最近保持状态,因此步骤#2必须具有高效性。
  4. 我的源表很容易包含100k +行,但需要每15分钟运行一次,这需要我确保这是非常高效的,因此我真的想避免使用游标。

    现在我能看到这样做的唯一方法是使用CLR sproc,但我认为可能有更好的方法,因此我希望你们能够朝着正确的方向推动我。

    我相信我可能会留下你可能需要的东西,所以请告诉我你可能需要的信息,我很乐意提供。

    提前谢谢!

    编辑: 好吧,我的问题不够明确。我的源表将包含多个跟踪ID。它可能高达100k +行,包含多个TrackingId和每个trackingId的多个状态。我必须为每个跟踪ID更新上面的目标表,但我的来源将是trackingId的混合。

5 个答案:

答案 0 :(得分:2)

这是一个没有自连接的解决方案:

WITH    q AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (ORDER BY statusDate) AS rn,
                ROW_NUMBER() OVER (PARTITION BY status ORDER BY statusDate) AS rns
        FROM    tracking
        WHERE   tackingId = @id
        ),
        qs AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY rn - rns ORDER BY statusDate) AS rnn
        FROM    q
        )
SELECT  *
FROM    qs
WHERE   rnn = 1
ORDER BY
        statusDate

这是一个要检查的脚本:

DECLARE @tracking TABLE
        (
        id INT NOT NULL PRIMARY KEY,
        trackingId INT NOT NULL,
        status INT,
        statusDate DATETIME
        )

INSERT
INTO    @tracking
SELECT  1, 1, 1, DATEADD(d, 1, '2010-01-01')
UNION ALL
SELECT  2, 1, 2, DATEADD(d, 2, '2010-01-01')
UNION ALL
SELECT  3, 1, 2, DATEADD(d, 3, '2010-01-01')
UNION ALL
SELECT  4, 1, 2, DATEADD(d, 4, '2010-01-01')
UNION ALL
SELECT  5, 1, 3, DATEADD(d, 5, '2010-01-01')
UNION ALL
SELECT  6, 1, 3, DATEADD(d, 6, '2010-01-01')
UNION ALL
SELECT  7, 1, 4, DATEADD(d, 7, '2010-01-01')
UNION ALL
SELECT  8, 1, 2, DATEADD(d, 8, '2010-01-01')
UNION ALL
SELECT  9, 1, 2, DATEADD(d, 9, '2010-01-01')
UNION ALL
SELECT  10, 1, 1, DATEADD(d, 10, '2010-01-01')
;
WITH    q AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (ORDER BY statusDate) AS rn,
                ROW_NUMBER() OVER (PARTITION BY status ORDER BY statusDate) AS rns
        FROM    @tracking
        ),
        qs AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY rn - rns ORDER BY statusDate) AS rnn
        FROM    q
        )
SELECT  *
FROM    qs
WHERE   rnn = 1
ORDER BY
        statusDate

答案 1 :(得分:1)

此效果的好坏取决于索引,特别是如果您一次定位一个TrackingID,但这是使用CTE和自联接获取所需结果的一种方法:

CREATE TABLE #foo
(
    TrackingID INT,
    [Status] VARCHAR(32),
    StatusDate SMALLDATETIME
);

INSERT #foo SELECT 1, 'PickedUp',  '2010-10-01 08:15';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 03:07';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 1, 'Delayed',   '2010-10-03 09:52';
INSERT #foo SELECT 1, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 1, 'AtDest',    '2010-10-04 13:42';
INSERT #foo SELECT 1, 'Deliv',     '2010-10-04 17:05';

WITH src AS
(
    SELECT 
        TrackingID,
        [Status],
        StatusDate, 
        ab = ROW_NUMBER() OVER (ORDER BY [StatusDate])
    FROM #foo
    WHERE TrackingID = 1
),
realsrc AS
(
    SELECT 
        a.TrackingID,
        leftrow         = a.ab,
        rightrow        = b.ab,
        leftstatus      = a.[Status],
        leftstatusdate  = a.StatusDate,
        rightstatus     = b.[Status],
        rightstatusdate = b.StatusDate 
    FROM src AS a
    LEFT OUTER JOIN src AS b
    ON a.ab = b.ab - 1
)
SELECT 
    Id = ROW_NUMBER() OVER (ORDER BY [leftstatusdate]),
    TrackingID,
    [Status] = leftstatus,
    [StatusDate] = leftstatusdate
FROM
    realsrc
WHERE
    rightrow IS NULL
    OR (leftrow = rightrow - 1 AND leftstatus <> rightstatus)
ORDER BY 
    [StatusDate];
GO
DROP TABLE #foo;

如果您需要在同一查询中支持多个TrackingID:

CREATE TABLE #foo
(
    TrackingID INT,
    [Status] VARCHAR(32),
    StatusDate SMALLDATETIME
);

INSERT #foo SELECT 1, 'PickedUp',  '2010-10-01 08:15';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 03:07';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 1, 'Delayed',   '2010-10-03 09:52';
INSERT #foo SELECT 1, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 1, 'AtDest',    '2010-10-04 13:42';
INSERT #foo SELECT 1, 'Deliv',     '2010-10-04 17:05';
INSERT #foo SELECT 2, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 2, 'Delayed',   '2010-10-03 09:52';
INSERT #foo SELECT 2, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 2, 'AtDest',    '2010-10-04 13:42';

WITH src AS
(
    SELECT 
        TrackingID,
        [Status],
        StatusDate, 
        ab = ROW_NUMBER() OVER (ORDER BY [StatusDate])
    FROM #foo
),
realsrc AS
(
    SELECT 
        a.TrackingID,
        leftrow         = a.ab,
        rightrow        = b.ab,
        leftstatus      = a.[Status],
        leftstatusdate  = a.StatusDate,
        rightstatus     = b.[Status],
        rightstatusdate = b.StatusDate 
    FROM src AS a
    LEFT OUTER JOIN src AS b
    ON a.ab = b.ab - 1
    AND a.TrackingID = b.TrackingID
)
SELECT 
    Id = ROW_NUMBER() OVER (ORDER BY TrackingID, [leftstatusdate]),
    TrackingID,
    [Status] = leftstatus,
    [StatusDate] = leftstatusdate
FROM
    realsrc
WHERE
    rightrow IS NULL
    OR (leftrow = rightrow - 1 AND leftstatus <> rightstatus)
ORDER BY 
    TrackingID, 
    [StatusDate];
GO
DROP TABLE #foo;

答案 2 :(得分:1)

你走了。我会让你清理它并进行优化。其中一个子查询可以进入视图,可以清理凌乱的日期比较。如果您使用的是SQL 2008 R2,请将CAST用作DATE。

    declare @tbl1 table(
id int, Trackingid int, Status varchar(50), StatusDate datetime
)

declare @tbl2 table(
id int, Trackingid int, Status varchar(50), StatusDate datetime
)

----Source data
insert into @tbl1 (id, trackingid, status, statusdate) values(1,1,'PickedUp','10/01/10  1:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(2,1,'InTransit','10/02/10 1:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(8,1,'InTransit','10/02/10  3:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(4,1,'Delayed','10/03/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(5,1,'InTransit','10/03/10 1:01')
insert into @tbl1 (id, trackingid, status, statusdate) values(6,1,'AtDest','10/03/10 2:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(7,1,'Deliv','10/03/10 3:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(3,2,'InTransit','10/03/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(9,2,'AtDest','10/04/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(10,2,'Deliv','10/04/10 1:05')
insert into @tbl1 (id, trackingid, status, statusdate) values(11,1,'Delayed','10/02/10 2:05')

----Target data
insert into @tbl2 (id, trackingid, status, statusdate) values(1,1,'PickedUp','10/01/10  1:00')
insert into @tbl2 (id, trackingid, status, statusdate) values(2,1,'InTransit','10/02/10 1:00')
insert into @tbl2 (id, trackingid, status, statusdate) values(3,1,'Deliv','10/03/10 3:00')


select d.* from
(
    select 
    * ,
    ROW_NUMBER() OVER(PARTITION BY trackingid, CAST((STR( YEAR( statusdate ) ) + '/' +STR( MONTH(statusdate ) ) + '/' +STR( DAY( statusdate ) )) AS DATETIME) ORDER BY statusdate) AS 'RN'
    from @tbl1
) d

where 
not exists
(
    select RN from
    (
        select 
        * ,
        ROW_NUMBER() OVER(PARTITION BY trackingid, CAST((STR( YEAR( statusdate ) ) + '/' +STR( MONTH(statusdate ) ) + '/' +STR( DAY( statusdate ) )) AS DATETIME) ORDER BY statusdate) AS 'RN'
        from @tbl1
    )f where f.RN = d.RN + 1 and d.status = f.status and f.trackingid = d.trackingid and 
    CAST((STR( YEAR( f.statusdate ) ) + '/' +STR( MONTH(f.statusdate ) ) + '/' +STR( DAY( f.statusdate ) )) AS DATETIME) =
            CAST((STR( YEAR( d.statusdate ) ) + '/' +STR( MONTH(d.statusdate ) ) + '/' +STR( DAY( d.statusdate ) )) AS DATETIME)
)

and
not exists 
(
    select 1 from @tbl2 t2
    where (t2.trackingid = d.trackingid
    and t2.statusdate = d.statusdate
    and t2.status = d.status)
)
and (
    not exists
    (
        select 1 from
        (
            select top 1 * from @tbl2 t2 
            where t2.trackingid = d.trackingid
            order by t2.statusdate desc
        ) g
        where g.status = d.status
    )
    or not exists
    (
        select 1 from
        (
            select top 1 * from @tbl2 t2 
            where t2.trackingid = d.trackingid
            and t2.statusdate <= d.statusdate
            order by t2.statusdate desc
        ) g
        where g.status = d.status
    )
)
order by trackingid,statusdate

答案 3 :(得分:0)

如果这是SQL 2005,那么您可以将ROW_NUMBER与子查询或CTE一起使用: 如果数据集真的很大并且性能是一个问题,那么在我试图让代码块工作时上面粘贴的其中一个可能会更有效。

/**
*  This is just to create a sample table to use in the test query
**/

DECLARE @test TABLE(ID INT, TrackingID INT, Status VARCHAR(20), StatusDate DATETIME)
INSERT    @test
SELECT    1,1,'PickedUp', '01 jan 2010 08:00' UNION
SELECT    2,1,'InTransit', '01 jan 2010 08:01' UNION
SELECT    3,1,'InTransit', '01 jan 2010 08:02' UNION
SELECT    4,1,'Delayed', '01 jan 2010 08:03' UNION
SELECT    5,1,'InTransit', '01 jan 2010 08:04' UNION
SELECT    6,1,'AtDest', '01 jan 2010 08:05' UNION
SELECT    7,1,'Deliv', '01 jan 2010 08:06'


/**
*  This would be the select code to exclude the duplicate entries. 
*  Sorting desc in row_number would get latest instead of first
**/
;WITH n AS
(
    SELECT    ID,
            TrackingID,
            Status,
            StatusDate,
            --For each Status for a tracking ID number by ID (could use date but 2 may be the same)
            ROW_NUMBER() OVER(PARTITION BY TrackingID, Status ORDER BY ID) AS [StatusNumber]
    FROM    @test
)
SELECT    ID,
        TrackingID,
        Status,
        StatusDate
FROM    n
WHERE    StatusNumber = 1
ORDER    BY ID

答案 4 :(得分:0)

我认为这个例子可以满足您的需求:

CREATE TABLE dbo.srcStatus (
 Id INT IDENTITY(1,1),
 TrackingId INT NOT NULL,
 [Status] VARCHAR(10) NOT NULL,
 StatusDate DATETIME NOT NULL
);

CREATE TABLE dbo.tgtStatus (
 Id INT IDENTITY(1,1),
 TrackingId INT NOT NULL,
 [Status] VARCHAR(10) NOT NULL,
 StatusDate DATETIME NOT NULL
);

INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'PickedUp','10/1/2010 8:15 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'InTransit','10/2/2010 3:07 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'InTransit','10/2/2010 10:28 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'PickedUp','10/1/2010 8:15 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'InTransit','10/2/2010 3:07 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'Delayed','10/2/2010 10:28 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'Delayed','10/3/2010 9:52 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'InTransit','10/3/2010 8:09 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'AtDest','10/4/2010 1:42 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'Deliv','10/4/2010 5:05 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'InTransit','10/3/2010 9:52 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'InTransit','10/3/2010 8:09 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'AtDest','10/4/2010 1:42 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'Deliv','10/4/2010 5:05 PM');

WITH    cteSrcTrackingIds
          AS ( SELECT DISTINCT
                        TrackingId
               FROM     dbo.srcStatus
             ),
        cteAllTrackingIds
          AS ( SELECT   TrackingId ,
                        [Status] ,
                        StatusDate
               FROM     dbo.srcStatus
               UNION
               SELECT   tgtStatus.TrackingId ,
                        tgtStatuS.[Status] ,
                        tgtStatus.StatusDate
               FROM     cteSrcTrackingIds
                        INNER JOIN dbo.tgtStatus ON cteSrcTrackingIds.TrackingId = tgtStatus.TrackingId
             ),
        cteAllTrackingIdsWithRownums
          AS ( SELECT   TrackingId ,
                        [Status] ,
                        StatusDate ,
                        ROW_NUMBER() OVER ( PARTITION BY TrackingId ORDER BY StatusDate ) AS rownum
               FROM     cteAllTrackingIds
             ),
        cteTrackingIdsWorkingSet
          AS ( SELECT   src.rownum AS [id] ,
                        src2.rownum AS [id2] ,
                        src.TrackingId ,
                        src.[Status] ,
                        src.StatusDate ,
                        ROW_NUMBER() OVER ( PARTITION BY src.TrackingId,
                                            src.rownum ORDER BY src.StatusDate ) AS rownum
               FROM     cteAllTrackingIdsWithRownums AS [src]
                        LEFT OUTER JOIN cteAllTrackingIdsWithRownums AS [src2] ON src.TrackingId = src2.TrackingId
                                                              AND src.rownum < src2.rownum
                                                              AND src.[Status] != src2.[Status]
             ),
        cteTrackingIdsSubset
          AS ( SELECT   id ,
                        TrackingId ,
                        [Status] ,
                        StatusDate ,
                        ROW_NUMBER() OVER ( PARTITION BY TrackingId, id2 ORDER BY id ) AS rownum
               FROM     cteTrackingIdsWorkingSet
               WHERE    rownum = 1
             )
    INSERT  INTO dbo.tgtStatus
            ( TrackingId ,
              [status] ,
              StatusDate
            )
            SELECT  cteTrackingIdsSubset.TrackingId ,
                    cteTrackingIdsSubset.[status] ,
                    cteTrackingIdsSubset.StatusDate
            FROM    cteTrackingIdsSubset
                    LEFT OUTER JOIN dbo.tgtStatus ON cteTrackingIdsSubset.TrackingId = tgtStatus.TrackingId
                                                     AND cteTrackingIdsSubset.[status] = tgtStatus.[status]
                                                     AND cteTrackingIdsSubset.StatusDate = tgtStatus.StatusDate
            WHERE   cteTrackingIdsSubset.rownum = 1
                    AND tgtStatus.id IS NULL
            ORDER BY cteTrackingIdsSubset.TrackingId ,
                    cteTrackingIdsSubset.StatusDate;