问:如何从PIG

时间:2018-01-31 07:08:46

标签: hadoop nested apache-pig bag

最初我有这样的结构:

+-------+-------+----+----+----+-----+
| time  | type  | s1 | s2 | id | p1  |
+-------+-------+----+----+----+-----+
| 10:30 | send  | a  | b  |  1 | 110 |
| 10:35 | send  | c  | d  |  1 | 120 |
| 10:31 | reply | e  | f  |  3 | 221 |
| 10:33 | reply | a  | c  |  1 | 210 |
| 10:34 | send  | a  | a  |  3 | 113 |
| 10:32 | reply | c  | d  |  3 | 157 |
+-------+-------+----+----+----+-----+

我想规范化表格:

  1. 按ID分组条目
  2. 在每个组内,找出最早的发送类型条目
  3. 使用最旧的发送类型条目
  4. 中的值替换其他条目的s1,s2

    ```

    +-------+-------+----+----+----+-----+
    | time  | type  | s1 | s2 | id | p1  |
    +-------+-------+----+----+----+-----+
    | 10:30 | send  | a  | b  |  1 | 110 |
    | 10:35 | send  | a  | b  |  1 | 120 |
    | 10:33 | reply | a  | b  |  1 | 210 |
    | 10:31 | reply | a  | a  |  3 | 221 |
    | 10:34 | send  | a  | a  |  3 | 113 |
    | 10:32 | reply | a  | a  |  3 | 157 |
    +-------+-------+----+----+----+-----+
    

    这就是我试图解决这个问题的方法:

    events_groupby_id = GROUP events BY id;
    events_normalized = FOREACH events_groupby_id {
       f_reqs = FILTER events BY type matches 'send';
       o_reqs = ORDER events BY time ASC;
       req = LIMIT o_reqs 1;
       GENERATE req, events;
    };
    

    我被困在这里。因为我发现events_normalized变成了一个带有嵌套袋的复杂结构,我不知道如何正确地展平。

    events_normalized | req:bag {:tuple()} |事件:袋{:元组()}

    从这里开始,我应该怎么做才能实现我想要的数据结构?如果有人能帮助我,我将非常感激。谢谢。

1 个答案:

答案 0 :(得分:1)

您可以使用events_normalized

取消FLATTEN中的行李
events_flattened = FOREACH events_normalized GENERATE 
    FLATTEN(req), 
    FLATTEN(events);

这会在reqevents之间创建一个交叉产品,但由于req中只有一个元组,因此每个原始条目最终只有一条记录。 events_flattened的架构是:

req::time | req::type | req::s1 | req::s2 | req::id | req::p1 | events::time | events::type | events::s1 | events::s2 | events::id | events::p1

现在你可以引用你想要保留的字段,使用events作为原始条目,使用req作为最旧发送类型条目的替换:

final = FOREACH events_flattened GENERATE 
    events::time AS time, 
    events::type AS type, 
    req::s1 AS s1, 
    req::s2 AS s2, 
    events::id AS id, 
    events::p1 AS p1;