根据条件加入并按时间范围过滤&仅限于Pig

时间:2015-06-17 08:22:24

标签: hadoop apache-pig

我有关系A,关系B.对于A中的每一行,关系B中可能有多个映射。

说:

A = (id1, type, location, gender, startDateTime)
B = (id2, type, location, gender, registerStartDateTime, registerEndDateTime, value)

我需要加入A& B上(类型,位置和性别)和何时(startDateTime> registerStartDateTime)和(startDateTime< registerEndDateTime)

此连接可能会返回B中具有不同值的多行。我想只选择第一个返回的行,最后输出。

output = Join A by (type, location, gender), B by (type, location, gender)

如何将日期时间范围条件添加到上述联接? 如何在执行连接时仅限制B中的一行?

在SQL中:

SELECT 
a.id, b.value
FROM
    a, b
WHERE
    a.type = b.type
        AND a.location = b.location
        AND a.gender = b.gender
        AND a.startDateTime between b.registerStartDateTime and b.registerEndDateTime 
limit 1;

如何在猪身上做同样的事情?

1 个答案:

答案 0 :(得分:1)

试试这个:

A = (id1, type, location, gender, startDateTime)
B = (id2, type, location, gender, registerStartDateTime, registerEndDateTime, value)

output = Join A by (type, location, gender), B by (type, location, gender)

filteroutput = filter output by (startDateTime > registerStartDateTime) AND (startDateTime < registerEndDateTime);

/*sortoutput = order filteroutput by  startDateTime ; 

  limitoutput = limit sortoutput 1 ;
*/

  limitoutput = limit filteroutput 1 ;