我有两个数据集;一个包含用户名,分配的ID以及ID有效的时间段,
data1: {username: chararray, id: chararray, start_time: datetime, stop_time: datetime}
,另一个包含由ID标识的用户生成的带时间戳的事件。
data2: {user_id: chararray, event_data: chararray, event_time: datetime)
我正在尝试加入这两个集合,以便我可以在ID有效的时间段内将用户名与事件匹配。基本上,我想在SQL术语中将以下条件应用于连接:
WHERE (data1.id = data2.user_id) AND (data2.event_time > data1.start_time) AND (data2.event_time < data1.stop_time)
我尝试了以下脚本:
joined = JOIN data1 BY id, data2 BY user_id;
matched = FILTER joined BY (SecondsBetween(start_time, event_time) < (long) 0) AND (SecondsBetween(event_time, stop_time) < (long) 0);
问题是当我尝试运行它时,我得到“错误0:标量在输出中有多行”错误。我不确定错误的含义或解决方法。
答案 0 :(得分:1)
您是如何加载数据的?我用超级简单的一行测试数据运行代码,它没有给我任何问题。以下是我的代码和测试数据。
猪脚本
tmp_data1 = LOAD 'data1.txt' USING PigStorage('\t') AS (username:chararray, id:chararray, start_time:chararray, stop_time:chararray);
tmp_data2 = LOAD 'data2.txt' USING PigStorage('\t') AS (user_id:chararray, event_data:chararray, event_time:chararray);
data1 = FOREACH tmp_data1 GENERATE
username, id, ToDate(start_time, 'yyyy-MM-dd HH:mm:ss') AS start_time, ToDate(stop_time, 'yyyy-MM-dd HH:mm:ss') AS stop_time;
data2 = FOREACH tmp_data2 GENERATE
user_id, event_data, ToDate(event_time, 'yyyy-MM-dd HH:mm:ss') AS event_time;
joined = JOIN data1 BY id, data2 BY user_id;
matched = FILTER joined BY (SecondsBetween(start_time, event_time) < (long) 0) AND (SecondsBetween(event_time, stop_time) < (long) 0);
dump matched;
data1.txt (应以制表符分隔)
abc abc 2015-01-01 00:00:00 2015-01-02 00:00:00
data2.txt (应以制表符分隔)
abc abc 2015-01-01 01:00:00