我为data lake设置了以下数据集,该数据集用作Dimension的源,我想在Dimension中迁移历史记录数据
例如:image
Primarykey Checksum DateFrom Dateto ActiveFlag
1 11 01:00 03:00 False
1 22 03:00 05:00 False
1 22 05:00 07:00 False
1 11 07:00 09:00 False
1 11 09:00 12/31/999 TRUE
请注意,datalake
表具有多个列,这些列不属于维,因此我们将重新计算检查显示相同的值,但datefrom
和dateto
with base as (
Select
Primary_key,
checksum,
first_value ( datefrom ) over ( partition by Primary_key ,checksum order by datefrom ) as Datefrom,
last_value ( dateto ) over ( partition by Primary_key ,checksum order by datefrom ) as Dateto,
rownumber () over ( partition by Primary_key ,checksum order by datefrom ) as latest_record
from Datalake.user)
select * from base where latest_record = 1
数据显示为
Primarykey Checksum DateFrom Dateto
1 11 01:00 12/31/999
1 22 03:00 07:00
但预期是
Primarykey Checksum DateFrom Dateto
1 11 01:00 03:00
1 22 03:00 07:00
1 11 07:00 12/31/999
我在单个查询中尝试了多种方式,但是有什么好的建议吗?
答案 0 :(得分:0)
之所以只得到两行,是因为分区Primarykey
和checksum
中有两列,而它们只有两个组合。期望输出中所需的行与期望输出中的第一行具有相同的Primarykey
和checksum
(1,11)。
如果您将ActiveFlag
包括在分区中,那么我在数据中看到的会带来结果的东西。
WITH base AS (
SELECT
primary_key,
checksum,
FIRST_VALUE (datefrom) OVER ( PARTITION BY primary_key, checksum, active_flag order by datefrom) AS datefrom,
LAST_VALUE (dateto) OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS dateto,
ROWNUMBER () OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS latest_record
FROM Datalake.user
)
SELECT * FROM base WHERE latest_record = 1
答案 1 :(得分:0)
尝试此代码。应该在Snowflake和Oracle中都可以使用: 如果校验和按日期更改顺序,则创建一个单独的组
**SNOWFLAKE**:
WITH base AS (
SELECT
Primarykey,
checksum,
FIRST_VALUE( datefrom ) OVER ( PARTITION BY Primarykey ,checksum,checksum_group ORDER BY datefrom ) AS Datefrom,
LAST_VALUE( dateto ) OVER ( PARTITION BY Primarykey ,checksum,checksum_group ORDER BY datefrom ) AS Dateto,
ROW_NUMBER() over ( PARTITION BY Primarykey ,checksum,checksum_group ORDER BY datefrom ) AS latest_record
FROM(
SELECT
Primarykey,
checksum,
checksum_prev,
datefrom,
dateto,
LAST_VALUE((case when checksum<>checksum_prev THEN group1 END)) IGNORE NULLS OVER (
ORDER BY group1
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) checksum_group
FROM (
SELECT
Primarykey,
checksum,
datefrom,
dateto,
LAG(checksum, 1, 0) OVER (ORDER BY datefrom) AS checksum_prev,
LPAD(1000 + ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), 4, 0) as group1
FROM Datalake.user)
)
)
SELECT * FROM base WHERE latest_record = 1
**Oracle**:
WITH base AS (
SELECT
Primarykey,
checksum,
FIRST_VALUE ( datefrom ) OVER ( partition by Primarykey ,checksum,checksum_group order by datefrom ) AS Datefrom,
LAST_VALUE ( dateto ) OVER ( partition by Primarykey ,checksum,checksum_group order by datefrom ) AS Dateto,
ROW_NUMBER() OVER ( PARTITION BY Primarykey ,checksum,checksum_group ORDER BY datefrom ) AS latest_record
FROM(
SELECT
Primarykey,
checksum,
checksum_prev,
datefrom,
dateto,
LAST_VALUE((CASE WHEN checksum<>checksum_prev THEN group1 END)) IGNORE NULLS
OVER (ORDER BY group1 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) checksum_group
FROM (
SELECT
Primarykey,
checksum,
datefrom,
dateto,
LAG(checksum, 1, 0) OVER (ORDER BY DATEFROM) AS checksum_prev,
LPAD(1000 + ROWNUM, 4, 0) as group1
FROM Datalake.user)))
SELECT * FROM base WHERE latest_record = 1
答案 2 :(得分:0)
我调整了查询,使其可以在整个数据集上使用。 由于缺少主键,整个数据都失败了。 修改后的工作查询