Question

我们的Azure Data Factory v2解决方案中有许多数据库表合并步骤。我们将表合并到Azure SQL Server数据库的单个实例中。源表和目标表位于不同的数据库模式中。源定义为对单个表的选择或对两个表的联接。

我的疑问是，从性能的角度来看，下面介绍的哪种方案更好。

方案1（每桌）

存储过程活动将调用执行所有工作的存储过程。管道中的存储过程活动将调用该存储过程。使用所有源数据向上插入目标表。这样的存储过程的示例：

create or alter procedure dwh.fill_lnk_cemvypdet_cemstr2c_table_with_stage_data as
    merge
        dwh.lnk_cemvypdet_cemstr2c as target
    using

        (select
                t.sa_hashkey cemvypdet_hashkey,
                t.sa_timestamp load_date,
                t.sa_source record_source,
                d.sa_hashkey cemstr2c_hashkey
            from
                egje.cemvypdet t
            join
                egje.cemstr2c d
            on
                t.id_mstr = d.id_mstr)
        as source
        on target.cemvypdet_hashkey = source.cemvypdet_hashkey
            and target.cemstr2c_hashkey = source.cemstr2c_hashkey
        when not matched then
            insert(
                cemvypdet_hashkey,
                cemstr2c_hashkey,
                record_source,
                load_date,
                last_seen_date)
            values(
                source.cemvypdet_hashkey,
                source.cemstr2c_hashkey,
                source.record_source,
                source.load_date,
                source.load_date)
        when matched then
            update set last_seen_date = source.load_date;

方案2（每行）

Copy活动在“目标”选项卡中声明要调用的存储过程，以便该活动为源的每一行调用存储过程。

create or alter procedure dwh.fill_lnk_cemvypdet_cemstr2c_subset_table_row_with_stage_data
@lnk_cemvypdet_cemstr2c_subset dwh.lnk_cemvypdet_cemstr2c_subset_type readonly
as
    merge
        dwh.lnk_cemvypdet_cemstr2c_subset as target
    using

    @lnk_cemvypdet_cemstr2c_subset
        as source
        on target.cemvypdet_hashkey = source.cemvypdet_hashkey
            and target.cemstr2c_hashkey = source.cemstr2c_hashkey
        when not matched then
            insert(
                hashkey,
                cemvypdet_hashkey,
                cemstr2c_hashkey,
                record_source,
                load_date,
                last_seen_date)
            values(
                source.hashkey,
                source.cemvypdet_hashkey,
                source.cemstr2c_hashkey,
                source.record_source,
                source.load_date,
                source.load_date)
        when matched then
            update set last_seen_date = source.load_date;

@ lnk_cemvypdet_cemstr2c_subset类型定义为遵循目标表结构的表类型。

Answer 1

方案1应该有更好的性能，但要考虑以下优化：

在源表的连接列上创建唯一且覆盖的索引。
在目标表的联接列上创建唯一的聚集索引。
在ON子句和WHEN子句中参数化所有文字值。
通过使用OFFSET和ROWS FETCH NEXT或通过在源或目标上定义返回已过滤行并将视图引用为源或目标表的视图，来合并从源到目标表的数据子集。此外，不建议使用TOP子句的WITH子句从源表或目标表中筛选出行，因为它们会产生不正确的结果。
要进一步优化合并操作，请尝试使用不同的批处理大小。 Here是原因。

Azure数据工厂V2：SQL合并的复制或存储过程活动

方案1（每桌）

方案2（每行）

1 个答案: