加入猪的多个领域

时间:2014-01-07 00:53:19

标签: apache-pig

我正在学习猪,不知道如何做以下事情。我在档案中存储了一系列关于聊天消息的元数据:

12345 13579
23456 24680
19350 20283
28394 20384
10384 29475
.
.
.

第一列是发件人的ID,第二列是接收者的ID。我想要做的是计算从男人到女人,男人到男人,女人到男人,女人到女人的信息。所以我有另一个存储用户ID和性别的文件:

12345 M
23456 F
34567 M
45678 M
.
.
.

因此Pig脚本可能如下所示:

messages = load 'messages.txt' as (from:int, to:int);
users = load 'users.txt' as (id:int,sex:chararray);

从那时起,我真的不确定下一步应该采取什么措施。我可以在向用户发送消息时加入一列,但不知道如何加入两列,然后进行后续分组。

任何建议/提示都会非常有用。

1 个答案:

答案 0 :(得分:1)

我想你想要的是加入然后分组并计算你的数据。

joinedSenderRaw = JOIN users BY id, messages BY from;

joinedSender = FOREACH joinedSenderRaw
    GENERATE messages::from as sender_id,
             users::sex as sender_sex,
             messages::to as receiver_id;

joinedAllRaw = JOIN joinedSender BY receiver_id, users BY id;

joinedAll = FOREACH joinedAllRaw
    GENERATE joinedSender::sender_id,
             joinedSender::sender_sex,
             joinedSender::receiver_id,
             users::sex as receiver_sex;

grouped = GROUP joinedAll BY (sender_sex, receiver_sex);

result = FOREACH grouped
    GENERATE $0.sender_sex AS sender_sex,
             $0.receiver_sex AS receiver_sex,
             COUNT($1) AS your_stat;

我没有测试它,但这样的事情应该有用。