从PIG中的另一个关系过滤列

时间:2015-09-21 20:36:44

标签: apache-pig

假设我在PIG中有以下数据。

declare @blogs table (blogs_id int, users_id int, blogs_score int, blogs_score_time datetime, gender char(1))

insert @blogs 
select 1, 11, 2852, '2015-09-09 5:21:51', 'm'
union all 
select 2, 23, 2146, '2015-09-10 7:31:54', 'm'
union all 
select 3, 23, 2146, '2015-09-10 7:32:26', 'm'
union all 
select 4, 23, 2852, '2015-09-10 4:42:15', 'm'
union all 
select 5, 51, 1793, '2015-09-11 8:15:55', 'f'
union all 
select 6, 88, 2947, '2015-09-11 9:33:18', 'f'


select top 100 
    row_number() OVER(PARTITION BY GENDER ORDER BY BLOGS_SCORE DESC) [Rank],
    * 
from @blogs
order by blogs_score desc, blogs_score_time desc

现在,我需要根据process_date eq max_date过滤raw。我尝试过以下方法:

DUMP raw;
(2015-09-15T22:11:00.000-07:00,1)
(2015-09-15T22:12:00.000-07:00,2)
(2015-09-15T23:11:00.000-07:00,3)
(2015-09-16T21:02:00.000-07:00,4)
(2015-09-15T00:02:00.000-07:00,5)
(2015-09-17T08:02:00.000-07:00,5)
(2015-09-17T09:02:00.000-07:00,5)
(2015-09-17T09:02:00.000-07:00,1)
(2015-09-17T19:02:00.000-07:00,1)

DESCRIBE raw;
raw: {process_date: chararray,id: int}

A = GROUP raw BY id;
DESCRIBE A;
A: {group: int,raw: {(process_date: chararray,id: int)}}
DUMP A;

 (1,{(2015-09-15T22:11:00.000-07:00,1),(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)})
(2,{(2015-09-15T22:12:00.000-07:00,2)})
(3,{(2015-09-15T23:11:00.000-07:00,3)})
(4,{(2015-09-16T21:02:00.000-07:00,4)})
(5,{(2015-09-15T00:02:00.000-07:00,5),(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)})


    B = FOREACH A {generate raw,MAX(raw.process_date) AS max_date;}
    DUMP B;
        ({(2015-09-15T22:11:00.000-07:00,1),(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)},2015-09-17T19:02:00.000-07:00)
({(2015-09-15T22:12:00.000-07:00,2)},2015-09-15T22:12:00.000-07:00)
({(2015-09-15T23:11:00.000-07:00,3)},2015-09-15T23:11:00.000-07:00)
({(2015-09-16T21:02:00.000-07:00,4)},2015-09-16T21:02:00.000-07:00)
({(2015-09-15T00:02:00.000-07:00,5),(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)},2015-09-17T09:02:00.000-07:00)

    DESCRIBE B;
    B: {raw: {(process_date: chararray,id: int)},max_date: chararray}

有没有办法做这样的过滤?基本上,我需要根据最新日期过滤原始。 我得到的例外是:

C = FOREACH B {filtered = FILTER raw BY REGEX_EXTRACT(process_date,'(\\d{4}-\\d{2}-\\d{2})',1) eq REGEX_EXTRACT(max_date,'(\\d{4}-\\d{2}-\\d{2})',1)}, but its not working.

预期输出:每个ID

的最新日期(非时间)对应的最新数据
Invalid field projection. Projected field [max_date] does not exist in schema: process_date:chararray,id:int

0 个答案:

没有答案