我是Apache Pig的新手并且正在努力学习。在Apache Pig中是否有等效的SQL COUNT(DISTINCT CASE WHEN ...)
例如,我正在尝试做这样的事情:
CREATE TABLE email_profile AS
SELECT user_id
, COUNT(DISTINCT CASE WHEN email_code = 'C' THEN message_id ELSE NULL END) AS clickthroughs
, COUNT(DISTINCT CASE WHEN email_code = 'O' THEN message_id ELSE NULL END) AS opened_messages
, COUNT(DISTINCT message_id) AS total_messages_received
FROM email_campaigns
GROUP BY user_id;
我无法使用FILTER email_campaigns BY email_code = 'C'
,因为这会削减其他情况。有没有办法在一个嵌套的FOREACH
块中执行此操作?
谢谢!
编辑:
根据要求,示例数据。字段为used_id
,email_code
和message_id
。
user1@example.com O 111
user1@example.com C 111
user2@example.com O 111
user1@example.com O 222
user2@example.com O 333
预期产出:
user1@example.com 2 1 2
user2@example.com 2 0 2
答案 0 :(得分:3)
您可以在used_id
上nested FOREACH
后GROUP
进行过滤。有关更多详细信息,请参阅我的代码中的注释。
类似的东西:
-- Firstly we group so the FOREACH is applied per used_id
A = GROUP email_campaigns BY used_id ;
B = FOREACH A {
-- We need these three lines to accomplish the:
-- DISTINCT CASE WHEN email_code = 'C' THEN message_id ELSE NULL END
-- First, we get only cases where email_code == 'C'
click_filt = FILTER email_campaigns BY email_code == 'C' ;
-- Since we only want unique message_ids, we need to project it out
click_proj = FOREACH click_filt GENERATE message_id ;
-- Now we can find all unique message_ids for a given filter
click_dist = DISTINCT click_proj ;
opened_filt = FILTER email_campaigns BY email_code == 'O' ;
opened_proj = FOREACH opened_filt GENERATE message_id ;
opened_dist = DISTINCT opened_proj ;
total_proj = FOREACH email_campaigns GENERATE message_id ;
total_dist = DISTINCT total_proj ;
GENERATE group AS used_id, COUNT(click_dist) AS clickthroughs,
COUNT(opened_dist) AS opened_messages,
COUNT(total_dist) AS total_messages_received ;
}
B
的输出应为:
(user1@example.com,1,2,2)
(user2@example.com,0,2,2)
如果您需要进一步澄清发生了什么,请告诉我。