Question

我建立了一个Data Factory管道，该管道将数据从Data Lake传输到Datawarehouse。我为我的尺寸选择了SCD类型1。

我的管道包含以下活动：

[存储过程]清除登台表；
[存储过程]获取上一次成功更新的时间戳；
[U-SQL]从Azure Data Lake中的过滤文件（自上次成功更新以来已进行修改的文件）中提取维度数据，对其进行转换并将其输出到一个csv文件中；
[复制数据]将csv加载到SQL数据仓库登台维度表中；
[存储过程]将登台表中的数据合并到生产表中；
[U-SQL]从Azure Data Lake中的文件（自上次成功更新以来已进行修改的文件）中提取事实数据，对其进行转换并将其输出为csv文件；
[复制数据]将csv加载到SQL数据仓库事实表中；
[存储过程]更新成功更新的时间戳。

此管道的问题在于，如果两次运行管道，最终我的仓库中将出现重复的事实条目。

问题

考虑到Azure SQL数据仓库中的所有the unsupported features，如何有效地防止事实表中出现重复行？

更新

我还阅读了另一条有关仓库的索引（和统计数据）以及在更新后必须如何重建的信息。

考虑到这一点，我想到的最简单的事情是将与我用于“维度”的事实相同的原理应用于事实。我可以将所有新事实加载到登台表中，但是然后使用事实表上的索引仅包括不存在的事实（事实现在无法更新）。

Answer 1

在Azure SQL数据仓库中进行提升...您的性能将大大提高，并且您的问题将消失。

已过滤文件中有几行？如果它在数百万到数千万之间，我认为您可以避免在数据湖阶段使用过滤器。 Polybase + SQL的性能应克服额外的数据量。

如果可以避免使用过滤器，请使用以下逻辑并放弃U-SQL处理：

通过适当的哈希分布将文件提取到登台表
获取每行的最新版本（适用于SCD1）
使用这样的查询将事实整合到事实：

BK =业务密钥列/秒。 COLn =非关键列

-- Get latest row for each business key to eliminate duplicates.

create table stage2 with (heap,distribution = hash(bk)) as
select  bk,
        col1,
        col2,
        row_number() over (partition by bk order by timestamp desc) rownum
from    stage
where   rownum = 1;

-- Merge the stage into a copy of the dimension

create table dimension_copy with (heap,distribution=replicate) as

select    s.bk,
          s.col1,
          s.col2
from      stage2 s
where     not exists (
              select  1
              from    schema.dimension d
              where   d.bk = s.bk)

union

select   d.bk,
         case when s.bk is null then d.col1 else s.col1 end,
         case when s.bk is null then d.col2 else s.col2 end
from     dimension d
         left outer join stage2 s on s.bk = d.bk;

-- Switch the merged copy with the original 

alter table dimension_copy switch to dimension with (truncate_target=on);

-- Force distribution of replicated table across nodes

select top 1 * from dimension;

如何有效地防止在事实表中重复行？

1 个答案: