我需要在Pig中执行过滤操作。有两种关系' a'和' b'。每个都有ID和时间戳,以及许多其他字段。我想要返回一个' a'每一行都有一定的时差与' b'已移除。两个关系中的行都是无序的,但基于ID进行匹配。
问题在于,在操作之后,我与来自' b'的一堆字段的关系陷入困境。我之前看到的删除列的唯一方法是下面的FOREACH语句,但实际的关系有很多元素,使代码变得庞大,难以维护和错误。
a = LOAD 'a.txt' AS (a1:chararray, aTime:chararray, aId:chararray);
b = LOAD 'b.txt' AS (b1:chararray, bTime:chararray, bId:chararray);
--Match a and b rows together based on similar ID's
ab = JOIN a BY aId LEFT OUTER, b BY bId;
--Remove rows with a big difference in timestamps
ab = FILTER ab BY timeDifference(a::aTime, b::bTime) < 60;
--ab has columns from b left over, I only want columns from relation a
aResult = FOREACH ab GENERATE
a1 AS a1,
aTime AS aTime,
aId AS aId; --This list is way too awkward
有没有更好的方法在Pig中执行此类操作,根据不同关系中的列删除关系中的行?