我正在构建一个包含多个存储相同数据的数据仓库。其中一个中间层中的所有数据都使用开始日期和结束日期进行版本化,就像它是类型2缓慢变化的维度一样。查询这些表时会出现问题。表中的列通常比查询中的列多,因此查询中的相邻版本具有不同的开始和结束日期,但在其他方面相同。我想组合这些版本,以显示查询中的列更改时的日期,而不是表中的行更改时的日期。
我有一些几乎可行的SQL:
create table versions
(id int
, name varchar(100) Not null
, RowStartDate datetime Not null
, RowEndDate datetime Not null
, primary key (id,RowStartDate)
, check (RowStartDate < RowEndDate));
insert into versions values
(1,'A','2014-01-01','9999-12-31')
,(2,'B','2014-01-01','2014-12-31')
,(2,'B','2014-12-31','9999-12-31')
,(3,'C','2014-01-01','2014-12-31')
,(3,'CC','2014-12-31','2015-12-31')
,(3,'CC','2015-12-31','9999-12-31')
,(4,'D','2014-01-01','2014-12-31')
,(4,'DD','2014-12-31','2015-12-31')
,(4,'DD','2015-12-31','2016-12-31')
,(4,'D','2016-12-31','9999-12-31')
,(5,'E','2014-01-01','2014-12-31')
,(5,'E','2014-12-31','2015-12-31')
,(5,'E','2015-12-31','2016-12-31')
,(5,'E','2016-12-31','2017-12-31')
,(5,'E','2017-12-31','9999-12-31')
;
WITH CTE_detect_duplicates AS (SELECT [id]
,[name]
,[RowStartDate]
,[RowEndDate]
,LAST_VALUE(RowEndDate) OVER (PARTITION BY id, name ORDER BY RowStartDate, RowEndDate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as LastEndDate
,rank() OVER (PARTITION BY id, name ORDER BY RowStartDate, RowEndDate) as duplicateNumber
FROM versions
)
SELECT [id]
,[name]
,[RowStartDate]
,LastEndDate as RowEndDate
FROM CTE_detect_duplicates
WHERE duplicateNumber = 1
这里的问题是,当需要三个时,它返回id为“4”的两行。 实际:
id name RowStartDate RowEndDate 4 D 2014-01-01 00:00:00.000 9999-12-31 00:00:00.000 4 DD 2014-12-31 00:00:00.000 2016-12-31 00:00:00.000期望的:
id name RowStartDate RowEndDate 4 D 2014-01-01 00:00:00.000 2014-12-31 00:00:00.000 4 DD 2014-12-31 00:00:00.000 2016-12-31 00:00:00.000 4 D 2016-12-31 00:00:00.000 9999-12-31 00:00:00.000值DD在值DD正确的时间段内不正确,因此我的查询的第一行(4,'D')的版本日期不正确。
我希望能够在纯SQL或内联表值函数中删除这些重复项(我有一个生成多语句表值函数的生成器,但生成的函数表现不佳)。有没有人有任何想法?
答案 0 :(得分:1)
以下查询包含多个CTE,会压缩更新的日期范围并删除重复值。
1根据RowStartDate在每个id组中分配第一个排名。
2接下来,确定具有相同next_rank_no
值的等级范围的最大等级(NAME
)。因此,对于示例数据,id = 5的行1将具有next_rank_no = 5并且id = 4的行2将具有next_rank_no = 3。此版本仅处理NAME
列。如果要处理其他列,则它们也必须包含在条件中。例如,如果要包含LOCATION
列,则连接条件将显示为:
left join sorted_versions sv2 on sv2.id = sv1.id and sv2.rank_no > sv1.rank_no and sv2.name = sv1.name and sv2.location = sv1.location
left join sorted_versions sv3 on sv3.id = sv1.id and sv3.rank_no > sv1.rank_no and (sv3.name <> sv1.name or sv3.location <> sv1.location)
3最后,选择每个id的第一行。然后,以递归方式选择对应于next_rank_no
的行。
with sorted_versions as --ranks are assigned within each id group
(
select
v1.id,
v1.name,
v1.RowStartDate,
v1.RowEndDate,
rank() over (partition by v1.id order by v1.RowStartDate) rank_no
from versions v1
left join versions v2 on (v1.id = v2.id and v2.RowStartDate = v1.RowEndDate)
),
next_rank as --the maximum rank of the range of ranks which has the same value for NAME
(
select
sv1.id id, sv1.rank_no rank_no, COALESCE(min(sv3.rank_no)-1 , COALESCE(max(sv2.rank_no), sv1.rank_no)) next_rank_no
from sorted_versions sv1
left join sorted_versions sv2 on sv2.id = sv1.id and sv2.rank_no > sv1.rank_no and sv2.name = sv1.name
left join sorted_versions sv3 on sv3.id = sv1.id and sv3.rank_no > sv1.rank_no and sv3.name <> sv1.name
group by sv1.id, sv1.rank_no
),
versions_cte as --the rowenddate of the "maximum rank" is selected
(
select sv.id, sv.name, sv.rowstartdate, sv3.rowenddate, nr.next_rank_no rank_no
from sorted_versions sv
inner join next_rank nr on sv.id = nr.id and sv.rank_no = nr.rank_no and sv.rank_no = 1
inner join sorted_versions sv3 on nr.id = sv3.id and nr.next_rank_no = sv3.rank_no
union all
select
sv2.id,
sv2.name,
sv2.rowstartdate,
sv3.rowenddate,
nr.next_rank_no
from versions_cte vc
inner join sorted_versions sv2 on sv2.id = vc.id and sv2.rank_no = vc.rank_no + 1
inner join next_rank nr on sv2.id = nr.id and sv2.rank_no = nr.rank_no
inner join sorted_versions sv3 on nr.id = sv3.id and nr.next_rank_no = sv3.rank_no
)
select id, name, rowstartdate, rowenddate
from versions_cte
order by id, rowstartdate;