Question

我有两个数据集如下，需要根据日期范围逻辑合并两个数据集。请提出任何建议？并且驱动程序表是A

    Table A     
UID Start Date  End Date                    A_Val
1   1980-01-01 00:00:00 1980-02-01 00:00:00 A
1   1980-02-02 00:00:00 1980-03-10 00:00:00 B
1   1980-03-11 00:00:00 1980-03-24 00:00:00 C

    Table B     
UID Start Date             End Date         B_Val
1   1980-01-10 00:00:00 1980-02-01 00:00:00 G
1   1980-02-02 00:00:00 1980-03-01 00:00:00 H
1   1980-03-02 00:00:00 1980-03-24 00:00:00 I

结果/输出需要如下

UID Start Date  End Date    A_Val   B_Val
1   1980-01-01 00:00:00 1980-01-09 00:00:00 A   NULL
1   1980-01-10 00:00:00 1980-02-01 00:00:00 A   G
1   1980-02-02 00:00:00 1980-03-01 00:00:00 B   H
1   1980-03-02 00:00:00 1980-03-10 00:00:00 B   I
1   1980-03-11 00:00:00 1980-03-24 00:00:00 C   I

Table Detail

根据日期范围计算需要输出如下

out put of Merged Table

Answer 1

你可以通过多种方式实现，这里有一个：

从整个集合中查找最小和最大日期（子查询T），
使用分层查询创建每日条目（子查询D），
从A和B
将群组分配到连续的时段，具有相同的A_VAL和B_VAL（子查询G），
使用分配的组号分组数据。

<强> ^{SQLFiddle demo}

with 
  T as (select min(start_date) sd, max(end_date) ed 
          from (select start_date, end_date from a union all
                select start_date, end_date from b)),
  D as (select sd + level - 1 dt from t connect by sd + level - 1 <= ed), 
  G as (select dt, a_val, b_val,
               row_number() over (order by dt) -
               row_number() over (partition by a_val, b_val order by dt) grp
          from d
          left join a on dt between a.start_date and a.end_date
          left join b on dt between b.start_date and b.end_date)
select min(dt) sd, max(dt) ed, min(a_val) a_val, min(b_val) b_val
  from g group by grp order by sd

结果：

SD          ED          A_VAL B_VAL
----------- ----------- ----- -----
1980-01-01  1980-01-09  A     
1980-01-10  1980-02-01  A     G
1980-02-02  1980-03-01  B     H
1980-03-02  1980-03-10  B     I
1980-03-11  1980-03-24  C     I

如果您首先为一个U_ID过滤器数据执行此操作。如果对于许多U_ID，那么你必须在分区和分组中考虑这个值。

使用sql / Spark合并两个表

1 个答案: