Apache Pig中是否有相当于多个COUNT(DISTINCT CASE WHEN ...)的语句?

时间:2013-10-17 16:45:28

标签: sql hadoop case apache-pig

我是Apache Pig的新手并且正在努力学习。在Apache Pig中是否有等效的SQL COUNT(DISTINCT CASE WHEN ...)

例如,我正在尝试做这样的事情:

CREATE TABLE email_profile AS
SELECT user_id
, COUNT(DISTINCT CASE WHEN email_code = 'C' THEN message_id ELSE NULL END) AS clickthroughs
, COUNT(DISTINCT CASE WHEN email_code = 'O' THEN message_id ELSE NULL END) AS opened_messages
, COUNT(DISTINCT message_id) AS total_messages_received
FROM email_campaigns
 GROUP BY user_id;

我无法使用FILTER email_campaigns BY email_code = 'C',因为这会削减其他情况。有没有办法在一个嵌套的FOREACH块中执行此操作?

谢谢!

编辑:

根据要求,示例数据。字段为used_idemail_codemessage_id

user1@example.com    O     111
user1@example.com    C     111
user2@example.com    O     111
user1@example.com    O     222
user2@example.com    O     333

预期产出:

user1@example.com    2    1    2
user2@example.com    2    0    2

1 个答案:

答案 0 :(得分:3)

您可以在used_idnested FOREACHGROUP进行过滤。有关更多详细信息,请参阅我的代码中的注释。

类似的东西:

-- Firstly we group so the FOREACH is applied per used_id
A = GROUP email_campaigns BY used_id ;
B = FOREACH A {
        -- We need these three lines to accomplish the:
        -- DISTINCT CASE WHEN email_code = 'C' THEN message_id ELSE NULL END
        -- First, we get only cases where email_code == 'C'
        click_filt = FILTER email_campaigns BY email_code == 'C' ;
        -- Since we only want unique message_ids, we need to project it out
        click_proj = FOREACH click_filt GENERATE message_id ;
        -- Now we can find all unique message_ids for a given filter
        click_dist = DISTINCT click_proj ;

        opened_filt = FILTER email_campaigns BY email_code == 'O' ;
        opened_proj = FOREACH opened_filt GENERATE message_id ;
        opened_dist = DISTINCT opened_proj ;

        total_proj = FOREACH email_campaigns GENERATE message_id ;
        total_dist = DISTINCT total_proj ;
    GENERATE group AS used_id, COUNT(click_dist) AS clickthroughs,
                               COUNT(opened_dist) AS opened_messages,
                               COUNT(total_dist) AS total_messages_received ;
}

B的输出应为:

(user1@example.com,1,2,2)
(user2@example.com,0,2,2)

如果您需要进一步澄清发生了什么,请告诉我。