每当它们发生变化时如何选择列?

时间:2016-03-18 18:10:20

标签: sql postgresql netezza

我正在尝试创建一个缓慢变化的维度(类型2维度),并且在如何逻辑写出它时有点迷失。假设我们有一个格式为Person | Country | Department | Login Time的源表。我想用Person | Country | Department | Eff Start time | Eff End Time创建此维度表。

数据可能如下所示:

Person | Country | Department | Login Time
------------------------------------------
Bob    | CANADA  | Marketing  | 2009-01-01
Bob    | CANADA  | Marketing  | 2009-02-01
Bob    | USA     | Marketing  | 2009-03-01
Bob    | USA     | Sales      | 2009-04-01
Bob    | MEX     | Product    | 2009-05-01
Bob    | MEX     | Product    | 2009-06-01
Bob    | MEX     | Product    | 2009-07-01
Bob    | CANADA  | Marketing  | 2009-08-01

我想要的Type 2维度如下所示:

Person | Country | Department | Eff Start time | Eff End Time
------------------------------------------------------------------
Bob    | CANADA  | Marketing  | 2009-01-01     | 2009-03-01
Bob    | USA     | Marketing  | 2009-03-01     | 2009-04-01
Bob    | USA     | Sales      | 2009-04-01     | 2009-05-01
Bob    | MEX     | Product    | 2009-05-01     | 2009-08-01
Bob    | CANADA  | Marketing  | 2009-08-01     | NULL 

假设自2009-08-01以来Bob的姓名,国家和部门尚未更新,因此保留为NULL

什么功能最适合这里?这是在Netezza上,它使用Postgres的味道。

显然GROUP BY因为之后的相同分组而无法在此处工作(我在最后一行的Bob | CANADA | Marketing中添加了以显示此内容。

修改

在Person,Country和Department上包含一个哈希列是有道理的,对吗?考虑使用

的逻辑
SELECT PERSON, COUNTRY, DEPARTMENT
FROM table t1
where 
    person = person 
    AND t1.hash <> hash_function(person, country, department)

1 个答案:

答案 0 :(得分:1)

答案

create table so (
  person varchar(32)
  ,country varchar(32)
  ,department varchar(32)
  ,login_time date
) distribute on random;

insert into so values ('Bob','CANADA','Marketing','2009-01-01');
insert into so values ('Bob','CANADA','Marketing','2009-02-01');
insert into so values ('Bob','USA','Marketing','2009-03-01');
insert into so values ('Bob','USA','Sales','2009-04-01');
insert into so values ('Bob','MEX','Product','2009-05-01');
insert into so values ('Bob','MEX','Product','2009-06-01');
insert into so values ('Bob','MEX','Product','2009-07-01');
insert into so values ('Bob','CANADA','Marketing','2009-08-01');

/* ************************************************************************** */

with prm as ( --Create an ordinal primary key.
  select
    *
    ,row_number() over (
      partition by person
      order by login_time
    ) rwn
  from
    so
), chn as ( --Chain events to their previous and next event.
  select
    cur.rwn
    ,cur.person
    ,cur.country
    ,cur.department
    ,cur.login_time cur_login
    ,case
      when
        cur.country = prv.country
        and cur.department = prv.department
        then 1
      else 0
    end prv_equal
    ,case
      when
        (
          cur.country = nxt.country
          and cur.department = nxt.department
        ) or nxt.rwn is null --No next record should be equivalent to matching.
        then 1
      else 0
    end nxt_equal
    ,case prv_equal
      when 0 then cur_login
      else null
    end eff_login_start_sparse
    ,case
      when eff_login_start_sparse is null
        then max(eff_login_start_sparse) over (
          partition by cur.person
          order by rwn
          rows unbounded preceding --The secret sauce.
        )
      else eff_login_start_sparse
    end eff_login_start
    ,case nxt_equal
      when 0 then cur_login
      else null
    end eff_login_end
  from
    prm cur
    left outer join prm nxt on
      cur.person = nxt.person
      and cur.rwn + 1 = nxt.rwn
    left outer join prm prv on
      cur.person = prv.person
      and cur.rwn - 1 = prv.rwn
), grp as ( --Group by login starts.
  select
    person
    ,country
    ,department
    ,eff_login_start
    ,max(eff_login_end) eff_login_end
  from
    chn
  group by
    person
    ,country
    ,department
    ,eff_login_start
), led as ( --Change the effective end to be the next start, if desired.
  select
    person
    ,country
    ,department
    ,eff_login_start
    ,case
      when eff_login_end is null
        then null
      else
        lead(eff_login_start) over (
          partition by person
          order by eff_login_start
        )
    end eff_login_end
  from
    grp
)
select * from led order by eff_login_start;

此代码返回下表。

 PERSON | COUNTRY | DEPARTMENT | EFF_LOGIN_START | EFF_LOGIN_END
--------+---------+------------+-----------------+---------------
 Bob    | CANADA  | Marketing  | 2009-01-01      | 2009-03-01
 Bob    | USA     | Marketing  | 2009-03-01      | 2009-04-01
 Bob    | USA     | Sales      | 2009-04-01      | 2009-05-01
 Bob    | MEX     | Product    | 2009-05-01      | 2009-08-01
 Bob    | CANADA  | Marketing  | 2009-08-01      |

解释

在过去的几年里,我必须解决这四五次问题,而忽略了正式写下来。我很高兴有机会这样做,所以这是一个很好的问题。

尝试这个时,我喜欢用矩阵形式写下问题。这是输入,假设所有值在SCD中都具有相同的键。

 Cv | Ce
----|----
 A  | 10
 A  | 11
 B  | 14
 C  | 16
 D  | 18
 D  | 25
 D  | 34
 A  | 40

其中Cv是我们需要比较的值(同样,假设SCD的键值在此数据中相等;我们将在整个时间内对键值进行分区,因此它与解决方案)和Ce是事件时间。

首先,我们需要一个序数主键。我在表中指定了这个Ck。这将允许我们将表连接到自己以获取上一个和下一个事件。我将这些列称为Pk(上一个键),Nk(下一个键),Pv和Nv。

 Cv | Ce | Ck | Pk | Pv | Nk | Nv |
----|----|----|----|----|----|----|
 A  | 10 | 1  |    |    | 2  | A  |
 A  | 11 | 2  | 1  | A  | 3  | B  |
 B  | 14 | 3  | 2  | A  | 4  | C  |
 C  | 16 | 4  | 3  | B  | 5  | D  |
 D  | 18 | 5  | 4  | C  | 6  | D  |
 D  | 25 | 6  | 5  | D  | 7  | D  |
 D  | 34 | 7  | 6  | D  | 8  | A  |
 A  | 40 | 8  | 7  | D  |    |    |

现在我们需要一些列来查看我们是否位于连续事件块的开头或结尾。我会称这些Pc和Nc为连续的。 Pc定义为Pv = Cv =&gt;真正。 1表示真,0表示假。 Nc的定义类似,除了null case默认为true(我们将在一分钟内看到原因)

 Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc |
----|----|----|----|----|----|----|----|----|
 A  | 10 | 1  |    |    | 2  | A  | 0  | 1  |
 A  | 11 | 2  | 1  | A  | 3  | B  | 1  | 0  |
 B  | 14 | 3  | 2  | A  | 4  | C  | 0  | 0  |
 C  | 16 | 4  | 3  | B  | 5  | D  | 0  | 0  |
 D  | 18 | 5  | 4  | C  | 6  | D  | 0  | 1  |
 D  | 25 | 6  | 5  | D  | 7  | D  | 1  | 1  |
 D  | 34 | 7  | 6  | D  | 8  | A  | 1  | 0  |
 A  | 40 | 8  | 7  | D  |    |    | 0  | 1  |

现在你可以开始看看Pc,Nc的1,1组合是如何完全无用的记录。我们直观地知道这一点,因为第6行的Bob's Mex / Product组合在构建SCD时几乎是无用的信息。

所以让我们摆脱无用的信息。我将在这里添加两个新列:几乎完整的有效启动时间称为Sn,实际完成的有效结束时间称为Ee。当Pc为0时,Sn填充Ce,当Nc为0时,Ee填充Ce。

 Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee |
----|----|----|----|----|----|----|----|----|----|----|
 A  | 10 | 1  |    |    | 2  | A  | 0  | 1  | 10 |    |
 A  | 11 | 2  | 1  | A  | 3  | B  | 1  | 0  |    | 11 |
 B  | 14 | 3  | 2  | A  | 4  | C  | 0  | 0  | 14 | 14 |
 C  | 16 | 4  | 3  | B  | 5  | D  | 0  | 0  | 16 | 16 |
 D  | 18 | 5  | 4  | C  | 6  | D  | 0  | 1  | 18 |    |
 D  | 25 | 6  | 5  | D  | 7  | D  | 1  | 1  |    |    |
 D  | 34 | 7  | 6  | D  | 8  | A  | 1  | 0  |    | 34 |
 A  | 40 | 8  | 7  | D  |    |    | 0  | 1  | 40 |    |

这看起来非常接近,但我们仍然遇到无法按照Cv(人/国家/部门)分组的问题。我们需要的是Sn使用之前的Sn值填充所有这些空值。您可以在rwn < rwn上将此表连接到自身并获得最大值,但我将变得懒惰并使用Netezza的分析函数和rows unbounded preceding子句。这是我刚才描述的方法的捷径。因此,我们将创建另一个名为Es,efffective start的列,定义如下。

case
  when Sn is null
    then max(Sn) over (
      partition by k --key value of the SCD
      order by Ck
      rows unbounded preceding
    )
  else Sn
end Es

根据这个定义,我们得到了这个。

 Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee | Es |
----|----|----|----|----|----|----|----|----|----|----|----|
 A  | 10 | 1  |    |    | 2  | A  | 0  | 1  | 10 |    | 10 |
 A  | 11 | 2  | 1  | A  | 3  | B  | 1  | 0  |    | 11 | 10 |
 B  | 14 | 3  | 2  | A  | 4  | C  | 0  | 0  | 14 | 14 | 14 |
 C  | 16 | 4  | 3  | B  | 5  | D  | 0  | 0  | 16 | 16 | 16 |
 D  | 18 | 5  | 4  | C  | 6  | D  | 0  | 1  | 18 |    | 18 |
 D  | 25 | 6  | 5  | D  | 7  | D  | 1  | 1  |    |    | 18 |
 D  | 34 | 7  | 6  | D  | 8  | A  | 1  | 0  |    | 34 | 18 |
 A  | 40 | 8  | 7  | D  |    |    | 0  | 1  | 40 |    | 40 |

其余的都是微不足道的。按Es分组并获取Ee的最大值以获得此表。

 Cv | Es | Ee |
----|----|----|
 A  | 10 | 11 |
 B  | 14 | 14 |
 C  | 16 | 16 |
 D  | 18 | 34 |
 A  | 40 |    |

如果要在下次启动时填充有效结束时间,请再次将表连接到自身或使​​用lead()窗口函数来抓取它。