Redshift PostgreSQL独特的ON运算符

时间:2016-02-19 17:59:23

标签: postgresql distinct amazon-redshift distinct-on postgresql-8.0

我有一个我要解析的数据集,以查看多点触控属性。数据集由响应营销活动及其营销来源的潜在客户组成。

每个潜在客户都可以响应多个广告系列,我希望将他们的第一个营销来源和最后一个营销来源放在同一个表格中。

我在想我可以创建两个表并使用两者中的select语句。 第一个表将尝试使用来自每个人的最新营销来源创建一个表(使用电子邮件作为其唯一ID)。

create table temp.multitouch1 as (
select distinct on (email) email, date, market_source as last_source 
from sf.campaignmember
where date >= '1/1/2016' ORDER BY DATE DESC);

然后我会创建一个包含重复电子邮件的表格,但这次是第一个来源。

create table temp.multitouch2 as (
select distinct on (email) email, date, market_source as first_source 
from sf.campaignmember
where date >= '1/1/2016' ORDER BY DATE ASC);

最后,我想简单地选择电子邮件,并在各自的列中加入第一个和最后一个市场来源。

select a.email, a.last_source, b.first_source, a.date 
from temp.multitouch1 a
left join temp.multitouch b on b.email = a.email

由于不同于没有在redshift的postgresql版本上工作,我希望有人有想法以另一种方式解决这个问题。

编辑2/22:有关更多背景信息,我处理他们已回应的人和广告系列。每条记录都是&#34;广告系列响应&#34;并且每个人都可以拥有多个来自多个来源的广告系列响应。我尝试制作一个选择性声明,该声明将按人进行重复数据删除,然后为他们已回复的第一个广告系列/营销来源以及他们分别响应的最后一个广告系列/营销来源提供专栏。<\ n / p>

EDIT 2/24:理想输出是一个包含4列的表:email,last_source,first_source,date。

对于只有1个广告系列成员记录的人来说,第一个和最后一个源列是相同的,对于拥有超过1个广告系列成员记录的所有人来说,不同。

2 个答案:

答案 0 :(得分:4)

我相信你可以在case表达式中使用row_number(),如下所示:

SELECT
      email
    , MIN(first_source) AS first_source
    , MIN(date) first_date
    , MAX(last_source) AS last_source
    , MAX(date) AS last_date
FROM (
      SELECT
            email
          , date
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                  ELSE NULL
            END AS first_source
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                  ELSE NULL
            END AS last_source
      FROM sf.campaignmember
      WHERE date >= '2016-01-01'
      ) s
WHERE first_source IS NOT NULL
      OR last_source IS NOT NULL
GROUP BY
      email

在此测试:SQL Fiddle

PostgreSQL 9.3架构设置

CREATE TABLE campaignmember
    (email varchar(3), date timestamp, market_source varchar(1))
;

INSERT INTO campaignmember
    (email, date, market_source)
VALUES
    ('a@a', '2016-01-02 00:00:00', 'x'),
    ('a@a', '2016-01-03 00:00:00', 'y'),
    ('a@a', '2016-01-04 00:00:00', 'z'),
    ('b@b', '2016-01-02 00:00:00', 'x')
;

查询1

SELECT
      email
    , MIN(first_source) AS first_source
    , MIN(date) first_date
    , MAX(last_source) AS last_source
    , MAX(date) AS last_date
FROM (
      SELECT
            email
          , date
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                  ELSE NULL
            END AS first_source
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                  ELSE NULL
            END AS last_source
      FROM campaignmember
      WHERE date >= '2016-01-01'
      ) s
WHERE first_source IS NOT NULL
      OR last_source IS NOT NULL
GROUP BY
      email

<强> Results

| email | first_source |                first_date | last_source |                 last_date |
|-------|--------------|---------------------------|-------------|---------------------------|
|   a@a |            x | January, 02 2016 00:00:00 |           z | January, 04 2016 00:00:00 |
|   b@b |            x | January, 02 2016 00:00:00 |           x | January, 02 2016 00:00:00 |

&安培;对请求的小扩展,计算联系点的数量。

SELECT
      email
    , MIN(first_source) AS first_source
    , MIN(date) first_date
    , MAX(last_source) AS last_source
    , MAX(date) AS last_date
    , MAX(numof) AS Numberof_Contacts 
FROM (
      SELECT
            email
          , date
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                  ELSE NULL
            END AS first_source
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                  ELSE NULL
            END AS last_source
          , COUNT(*) OVER (PARTITION BY email) as numof
      FROM campaignmember
      WHERE date >= '2016-01-01'
      ) s
WHERE first_source IS NOT NULL
      OR last_source IS NOT NULL
GROUP BY
      email

答案 1 :(得分:0)

您可以使用旧的左连接分组最大值。

SELECT DISTINCT c1.email, c1.date, c1.market_source
FROM sf.campaignmember c1
  LEFT JOIN sf.campaignmember c2 
    ON c1.email = c2.email AND c1.date > c2.date AND c1.id > c2.id
  LEFT JOIN sf.campaignmember c3
    ON c1.email = c3.email AND c1.date < c3.date AND c1.id > c3.id
WHERE c1.date >= '1/1/2016' AND c2.date >= '1/1/2016'
      AND (c2.email IS NULL OR c3.email IS NULL)

这假设您有一个唯一的id列,如果(日期,电子邮件)是不需要的唯一ID。