对具有多个重复数据组的列进行分组

时间:2012-10-17 16:25:44

标签: tsql date range grouping gaps-and-islands

我需要根据位置的日期对某些数据进行分组,包括确定日期范围何时没有位置。我在那里的一些方式,我已设法生成范围和位置的所有日期的列表。

  • date1 location1
  • date2 location1
  • date3 location1
  • date4 Unknown
  • date5未知
  • date6 Unknown
  • date7 Location2
  • date8 Location2
  • date9 Location2
  • date10 Location2
  • date11 location1
  • date12 location1
  • date13 location1

使用普通组(显示分钟(日期)和最大值(日期)我会得到类似的东西:

  • LOCATION1,DATE1,date13
  • LOCATION2,date7,date10
  • 未知,date4,date6

但我想要这个:

  • LOCATION1,DATE1,DATE3
  • 未知,date4,date6
  • LOCATION2,date7,date9
  • LOCATION1,date11,date13

我还需要过滤掉未知的短距离,但这是次要的。

我希望这是有道理的,它看起来应该非常简单。

1 个答案:

答案 0 :(得分:1)

了解群岛和峡谷问题以及Itzik Ben-gan。有一种基于集合的方式来获得您想要的结果。

我正在研究使用ROW_NUMBER或RANK,但后来偶然发现LAG和LEAD(在SQL 2012中引入)很不错。我有下面的解决方案。它绝对可以简化,但是将它作为几个CTE使我的思维过程(尽可能有缺陷)更容易看到。我只是慢慢地将数据转换成我想要的。如果要查看每个新CTE产生的内容,请一次取消选择一个选择。

create table Junk
(aDate Datetime,
aLocation varchar(32))

insert into Junk values
('2000', 'Location1'),
('2001', 'Location1'),
('2002', 'Location1'),
('2004', 'Unknown'),
('2005', 'Unknown'),
('2006', 'Unknown'),
('2007', 'Location2'),
('2008', 'Location2'),
('2009', 'Location2'),
('2010', 'Location2'),
('2011', 'Location1'),
('2012', 'Location1'),
('2013', 'Location1'),
('2014', 'Location3')


;WITH StartsMiddlesAndEnds AS
(
    select
    aLocation, 
    aDate, 
    CASE(LAG(aLocation) OVER (ORDER BY aDate, aLocation)) WHEN aLocation THEN 0 ELSE 1 END [isStart],
    CASE(LEAD(aLocation) OVER (ORDER BY aDate, aLocation)) WHEN aLocation THEN 0 ELSE 1 END [isEnd]
    from Junk 
)
--select * from NumberedStartsMiddlesAndEnds
,NumberedStartsAndEnds AS --let's get rid of the rows that are in the middle of consecutive date groups
(
    select 
    aLocation,
    aDate,
    isStart,
    isEnd,
    ROW_NUMBER() OVER(ORDER BY aDate, aLocation) i
    FROM StartsMiddlesAndEnds 
    WHERE NOT(isStart = 0 AND isEnd = 0) --it is a middle row
)
--select * from NumberedStartsAndEnds
,CombinedStartAndEnds AS --now let's put the start and end dates in the same row
(
    select
    rangeStart.aLocation,
    rangeStart.aDate [aStart],
    rangeEnd.aDate [aEnd]
    FROM NumberedStartsAndEnds rangeStart
    join NumberedStartsAndEnds rangeEnd ON rangeStart.aLocation = rangeEnd.aLocation
    WHERE rangeStart.i = rangeEnd.i - 1 --consecutive rows
    and rangeStart.isStart = 1
    and rangeEnd.isEnd = 1
)
--select * from CombinedStartAndEnds
,OneDateIntervals AS --don't forget the cases where a single row is both a start and end
(
    select
    aLocation,
    aDate [aStart],
    aDate [aEnd]
    FROM NumberedStartsAndEnds
    WHERE isStart = 1 and isEnd = 1
)
--select * from OneDateIntervals
select aLocation, DATEPART(YEAR, aStart) [start], DATEPART(YEAR, aEnd) [end] from OneDateIntervals
UNION
select aLocation, DATEPART(YEAR, aStart) [start], DATEPART(YEAR, aEnd) [end] from CombinedStartAndEnds
ORDER BY DATEPART(YEAR, aStart)

并生成

aLocation   start   end
Location1   2000    2002
Unknown 2004    2006
Location2   2007    2010
Location1   2011    2013
Location3   2014    2014

没有2012?然后你仍然可以使用ROW_NUMBER获得相同的StartsMiddlesAndEnds CTE:

;WITH NumberedRows AS
(
    SELECT aLocation, aDate, ROW_NUMBER() OVER (ORDER BY aDate, aLocation) [i] FROM Junk
)
,StartsMiddlesAndEnds AS
(
    select
    currentRow.aLocation, 
    currentRow.aDate, 
    CASE upperRow.aLocation WHEN currentRow.aLocation THEN 0 ELSE 1 END [isStart],
    CASE lowerRow.aLocation WHEN currentRow.aLocation THEN 0 ELSE 1 END [isEnd]
    from
    NumberedRows currentRow
    left outer join NumberedRows upperRow on upperRow.i = currentRow.i-1
    left outer join NumberedRows lowerRow on lowerRow.i = currentRow.i+1
)
--select * from StartsMiddlesAndEnds