我正面临一个概念性的问题,我很难克服。我希望SO人们可以帮助我克服正确的方向。
我正在进行一些ETL工作,源数据非常相似且非常大。我将它加载到一个用于复制的表中,我只想要这个目标表中最基本的信息。
我的源表看起来像这样:
我需要我的目标表来反映它:
正如您所看到的,我没有复制在源表中重复的InTransit状态。我试图弄清楚如何实现的步骤是
我的源表很容易包含100k +行,但需要每15分钟运行一次,这需要我确保这是非常高效的,因此我真的想避免使用游标。
现在我能看到这样做的唯一方法是使用CLR sproc,但我认为可能有更好的方法,因此我希望你们能够朝着正确的方向推动我。
我相信我可能会留下你可能需要的东西,所以请告诉我你可能需要的信息,我很乐意提供。
提前谢谢!
编辑: 好吧,我的问题不够明确。我的源表将包含多个跟踪ID。它可能高达100k +行,包含多个TrackingId和每个trackingId的多个状态。我必须为每个跟踪ID更新上面的目标表,但我的来源将是trackingId的混合。
答案 0 :(得分:2)
这是一个没有自连接的解决方案:
WITH q AS
(
SELECT *,
ROW_NUMBER() OVER (ORDER BY statusDate) AS rn,
ROW_NUMBER() OVER (PARTITION BY status ORDER BY statusDate) AS rns
FROM tracking
WHERE tackingId = @id
),
qs AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY rn - rns ORDER BY statusDate) AS rnn
FROM q
)
SELECT *
FROM qs
WHERE rnn = 1
ORDER BY
statusDate
这是一个要检查的脚本:
DECLARE @tracking TABLE
(
id INT NOT NULL PRIMARY KEY,
trackingId INT NOT NULL,
status INT,
statusDate DATETIME
)
INSERT
INTO @tracking
SELECT 1, 1, 1, DATEADD(d, 1, '2010-01-01')
UNION ALL
SELECT 2, 1, 2, DATEADD(d, 2, '2010-01-01')
UNION ALL
SELECT 3, 1, 2, DATEADD(d, 3, '2010-01-01')
UNION ALL
SELECT 4, 1, 2, DATEADD(d, 4, '2010-01-01')
UNION ALL
SELECT 5, 1, 3, DATEADD(d, 5, '2010-01-01')
UNION ALL
SELECT 6, 1, 3, DATEADD(d, 6, '2010-01-01')
UNION ALL
SELECT 7, 1, 4, DATEADD(d, 7, '2010-01-01')
UNION ALL
SELECT 8, 1, 2, DATEADD(d, 8, '2010-01-01')
UNION ALL
SELECT 9, 1, 2, DATEADD(d, 9, '2010-01-01')
UNION ALL
SELECT 10, 1, 1, DATEADD(d, 10, '2010-01-01')
;
WITH q AS
(
SELECT *,
ROW_NUMBER() OVER (ORDER BY statusDate) AS rn,
ROW_NUMBER() OVER (PARTITION BY status ORDER BY statusDate) AS rns
FROM @tracking
),
qs AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY rn - rns ORDER BY statusDate) AS rnn
FROM q
)
SELECT *
FROM qs
WHERE rnn = 1
ORDER BY
statusDate
答案 1 :(得分:1)
此效果的好坏取决于索引,特别是如果您一次定位一个TrackingID,但这是使用CTE和自联接获取所需结果的一种方法:
CREATE TABLE #foo
(
TrackingID INT,
[Status] VARCHAR(32),
StatusDate SMALLDATETIME
);
INSERT #foo SELECT 1, 'PickedUp', '2010-10-01 08:15';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 03:07';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 1, 'Delayed', '2010-10-03 09:52';
INSERT #foo SELECT 1, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 1, 'AtDest', '2010-10-04 13:42';
INSERT #foo SELECT 1, 'Deliv', '2010-10-04 17:05';
WITH src AS
(
SELECT
TrackingID,
[Status],
StatusDate,
ab = ROW_NUMBER() OVER (ORDER BY [StatusDate])
FROM #foo
WHERE TrackingID = 1
),
realsrc AS
(
SELECT
a.TrackingID,
leftrow = a.ab,
rightrow = b.ab,
leftstatus = a.[Status],
leftstatusdate = a.StatusDate,
rightstatus = b.[Status],
rightstatusdate = b.StatusDate
FROM src AS a
LEFT OUTER JOIN src AS b
ON a.ab = b.ab - 1
)
SELECT
Id = ROW_NUMBER() OVER (ORDER BY [leftstatusdate]),
TrackingID,
[Status] = leftstatus,
[StatusDate] = leftstatusdate
FROM
realsrc
WHERE
rightrow IS NULL
OR (leftrow = rightrow - 1 AND leftstatus <> rightstatus)
ORDER BY
[StatusDate];
GO
DROP TABLE #foo;
如果您需要在同一查询中支持多个TrackingID:
CREATE TABLE #foo
(
TrackingID INT,
[Status] VARCHAR(32),
StatusDate SMALLDATETIME
);
INSERT #foo SELECT 1, 'PickedUp', '2010-10-01 08:15';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 03:07';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 1, 'Delayed', '2010-10-03 09:52';
INSERT #foo SELECT 1, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 1, 'AtDest', '2010-10-04 13:42';
INSERT #foo SELECT 1, 'Deliv', '2010-10-04 17:05';
INSERT #foo SELECT 2, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 2, 'Delayed', '2010-10-03 09:52';
INSERT #foo SELECT 2, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 2, 'AtDest', '2010-10-04 13:42';
WITH src AS
(
SELECT
TrackingID,
[Status],
StatusDate,
ab = ROW_NUMBER() OVER (ORDER BY [StatusDate])
FROM #foo
),
realsrc AS
(
SELECT
a.TrackingID,
leftrow = a.ab,
rightrow = b.ab,
leftstatus = a.[Status],
leftstatusdate = a.StatusDate,
rightstatus = b.[Status],
rightstatusdate = b.StatusDate
FROM src AS a
LEFT OUTER JOIN src AS b
ON a.ab = b.ab - 1
AND a.TrackingID = b.TrackingID
)
SELECT
Id = ROW_NUMBER() OVER (ORDER BY TrackingID, [leftstatusdate]),
TrackingID,
[Status] = leftstatus,
[StatusDate] = leftstatusdate
FROM
realsrc
WHERE
rightrow IS NULL
OR (leftrow = rightrow - 1 AND leftstatus <> rightstatus)
ORDER BY
TrackingID,
[StatusDate];
GO
DROP TABLE #foo;
答案 2 :(得分:1)
你走了。我会让你清理它并进行优化。其中一个子查询可以进入视图,可以清理凌乱的日期比较。如果您使用的是SQL 2008 R2,请将CAST用作DATE。
declare @tbl1 table(
id int, Trackingid int, Status varchar(50), StatusDate datetime
)
declare @tbl2 table(
id int, Trackingid int, Status varchar(50), StatusDate datetime
)
----Source data
insert into @tbl1 (id, trackingid, status, statusdate) values(1,1,'PickedUp','10/01/10 1:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(2,1,'InTransit','10/02/10 1:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(8,1,'InTransit','10/02/10 3:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(4,1,'Delayed','10/03/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(5,1,'InTransit','10/03/10 1:01')
insert into @tbl1 (id, trackingid, status, statusdate) values(6,1,'AtDest','10/03/10 2:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(7,1,'Deliv','10/03/10 3:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(3,2,'InTransit','10/03/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(9,2,'AtDest','10/04/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(10,2,'Deliv','10/04/10 1:05')
insert into @tbl1 (id, trackingid, status, statusdate) values(11,1,'Delayed','10/02/10 2:05')
----Target data
insert into @tbl2 (id, trackingid, status, statusdate) values(1,1,'PickedUp','10/01/10 1:00')
insert into @tbl2 (id, trackingid, status, statusdate) values(2,1,'InTransit','10/02/10 1:00')
insert into @tbl2 (id, trackingid, status, statusdate) values(3,1,'Deliv','10/03/10 3:00')
select d.* from
(
select
* ,
ROW_NUMBER() OVER(PARTITION BY trackingid, CAST((STR( YEAR( statusdate ) ) + '/' +STR( MONTH(statusdate ) ) + '/' +STR( DAY( statusdate ) )) AS DATETIME) ORDER BY statusdate) AS 'RN'
from @tbl1
) d
where
not exists
(
select RN from
(
select
* ,
ROW_NUMBER() OVER(PARTITION BY trackingid, CAST((STR( YEAR( statusdate ) ) + '/' +STR( MONTH(statusdate ) ) + '/' +STR( DAY( statusdate ) )) AS DATETIME) ORDER BY statusdate) AS 'RN'
from @tbl1
)f where f.RN = d.RN + 1 and d.status = f.status and f.trackingid = d.trackingid and
CAST((STR( YEAR( f.statusdate ) ) + '/' +STR( MONTH(f.statusdate ) ) + '/' +STR( DAY( f.statusdate ) )) AS DATETIME) =
CAST((STR( YEAR( d.statusdate ) ) + '/' +STR( MONTH(d.statusdate ) ) + '/' +STR( DAY( d.statusdate ) )) AS DATETIME)
)
and
not exists
(
select 1 from @tbl2 t2
where (t2.trackingid = d.trackingid
and t2.statusdate = d.statusdate
and t2.status = d.status)
)
and (
not exists
(
select 1 from
(
select top 1 * from @tbl2 t2
where t2.trackingid = d.trackingid
order by t2.statusdate desc
) g
where g.status = d.status
)
or not exists
(
select 1 from
(
select top 1 * from @tbl2 t2
where t2.trackingid = d.trackingid
and t2.statusdate <= d.statusdate
order by t2.statusdate desc
) g
where g.status = d.status
)
)
order by trackingid,statusdate
答案 3 :(得分:0)
如果这是SQL 2005,那么您可以将ROW_NUMBER与子查询或CTE一起使用: 如果数据集真的很大并且性能是一个问题,那么在我试图让代码块工作时上面粘贴的其中一个可能会更有效。
/**
* This is just to create a sample table to use in the test query
**/
DECLARE @test TABLE(ID INT, TrackingID INT, Status VARCHAR(20), StatusDate DATETIME)
INSERT @test
SELECT 1,1,'PickedUp', '01 jan 2010 08:00' UNION
SELECT 2,1,'InTransit', '01 jan 2010 08:01' UNION
SELECT 3,1,'InTransit', '01 jan 2010 08:02' UNION
SELECT 4,1,'Delayed', '01 jan 2010 08:03' UNION
SELECT 5,1,'InTransit', '01 jan 2010 08:04' UNION
SELECT 6,1,'AtDest', '01 jan 2010 08:05' UNION
SELECT 7,1,'Deliv', '01 jan 2010 08:06'
/**
* This would be the select code to exclude the duplicate entries.
* Sorting desc in row_number would get latest instead of first
**/
;WITH n AS
(
SELECT ID,
TrackingID,
Status,
StatusDate,
--For each Status for a tracking ID number by ID (could use date but 2 may be the same)
ROW_NUMBER() OVER(PARTITION BY TrackingID, Status ORDER BY ID) AS [StatusNumber]
FROM @test
)
SELECT ID,
TrackingID,
Status,
StatusDate
FROM n
WHERE StatusNumber = 1
ORDER BY ID
答案 4 :(得分:0)
我认为这个例子可以满足您的需求:
CREATE TABLE dbo.srcStatus (
Id INT IDENTITY(1,1),
TrackingId INT NOT NULL,
[Status] VARCHAR(10) NOT NULL,
StatusDate DATETIME NOT NULL
);
CREATE TABLE dbo.tgtStatus (
Id INT IDENTITY(1,1),
TrackingId INT NOT NULL,
[Status] VARCHAR(10) NOT NULL,
StatusDate DATETIME NOT NULL
);
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 1,'PickedUp','10/1/2010 8:15 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 1,'InTransit','10/2/2010 3:07 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 1,'InTransit','10/2/2010 10:28 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 2,'PickedUp','10/1/2010 8:15 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 2,'InTransit','10/2/2010 3:07 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 2,'Delayed','10/2/2010 10:28 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 1,'Delayed','10/3/2010 9:52 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 1,'InTransit','10/3/2010 8:09 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 1,'AtDest','10/4/2010 1:42 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 1,'Deliv','10/4/2010 5:05 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 2,'InTransit','10/3/2010 9:52 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 2,'InTransit','10/3/2010 8:09 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 2,'AtDest','10/4/2010 1:42 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES ( 2,'Deliv','10/4/2010 5:05 PM');
WITH cteSrcTrackingIds
AS ( SELECT DISTINCT
TrackingId
FROM dbo.srcStatus
),
cteAllTrackingIds
AS ( SELECT TrackingId ,
[Status] ,
StatusDate
FROM dbo.srcStatus
UNION
SELECT tgtStatus.TrackingId ,
tgtStatuS.[Status] ,
tgtStatus.StatusDate
FROM cteSrcTrackingIds
INNER JOIN dbo.tgtStatus ON cteSrcTrackingIds.TrackingId = tgtStatus.TrackingId
),
cteAllTrackingIdsWithRownums
AS ( SELECT TrackingId ,
[Status] ,
StatusDate ,
ROW_NUMBER() OVER ( PARTITION BY TrackingId ORDER BY StatusDate ) AS rownum
FROM cteAllTrackingIds
),
cteTrackingIdsWorkingSet
AS ( SELECT src.rownum AS [id] ,
src2.rownum AS [id2] ,
src.TrackingId ,
src.[Status] ,
src.StatusDate ,
ROW_NUMBER() OVER ( PARTITION BY src.TrackingId,
src.rownum ORDER BY src.StatusDate ) AS rownum
FROM cteAllTrackingIdsWithRownums AS [src]
LEFT OUTER JOIN cteAllTrackingIdsWithRownums AS [src2] ON src.TrackingId = src2.TrackingId
AND src.rownum < src2.rownum
AND src.[Status] != src2.[Status]
),
cteTrackingIdsSubset
AS ( SELECT id ,
TrackingId ,
[Status] ,
StatusDate ,
ROW_NUMBER() OVER ( PARTITION BY TrackingId, id2 ORDER BY id ) AS rownum
FROM cteTrackingIdsWorkingSet
WHERE rownum = 1
)
INSERT INTO dbo.tgtStatus
( TrackingId ,
[status] ,
StatusDate
)
SELECT cteTrackingIdsSubset.TrackingId ,
cteTrackingIdsSubset.[status] ,
cteTrackingIdsSubset.StatusDate
FROM cteTrackingIdsSubset
LEFT OUTER JOIN dbo.tgtStatus ON cteTrackingIdsSubset.TrackingId = tgtStatus.TrackingId
AND cteTrackingIdsSubset.[status] = tgtStatus.[status]
AND cteTrackingIdsSubset.StatusDate = tgtStatus.StatusDate
WHERE cteTrackingIdsSubset.rownum = 1
AND tgtStatus.id IS NULL
ORDER BY cteTrackingIdsSubset.TrackingId ,
cteTrackingIdsSubset.StatusDate;