使用日期时间执行复杂连接

时间:2015-03-25 19:36:31

标签: apache-pig

我有两个数据集;一个包含用户名,分配的ID以及ID有效的时间段,

data1: {username: chararray, id: chararray, start_time: datetime, stop_time: datetime}

,另一个包含由ID标识的用户生成的带时间戳的事件。

data2: {user_id: chararray, event_data: chararray, event_time: datetime)

我正在尝试加入这两个集合,以便我可以在ID有效的时间段内将用户名与事件匹配。基本上,我想在SQL术语中将以下条件应用于连接:

WHERE (data1.id = data2.user_id) AND (data2.event_time > data1.start_time) AND (data2.event_time < data1.stop_time)

我尝试了以下脚本:

joined = JOIN data1 BY id, data2 BY user_id;
matched = FILTER joined BY (SecondsBetween(start_time, event_time) < (long) 0) AND (SecondsBetween(event_time, stop_time) < (long) 0);

问题是当我尝试运行它时,我得到“错误0:标量在输出中有多行”错误。我不确定错误的含义或解决方法。

1 个答案:

答案 0 :(得分:1)

您是如何加载数据的?我用超级简单的一行测试数据运行代码,它没有给我任何问题。以下是我的代码和测试数据。

猪脚本

tmp_data1 = LOAD 'data1.txt' USING PigStorage('\t') AS (username:chararray, id:chararray, start_time:chararray, stop_time:chararray);
tmp_data2 = LOAD 'data2.txt' USING PigStorage('\t') AS (user_id:chararray, event_data:chararray, event_time:chararray);
data1 = FOREACH tmp_data1 GENERATE
    username, id, ToDate(start_time, 'yyyy-MM-dd HH:mm:ss') AS start_time, ToDate(stop_time, 'yyyy-MM-dd HH:mm:ss') AS stop_time; 
data2 = FOREACH tmp_data2 GENERATE
    user_id, event_data, ToDate(event_time, 'yyyy-MM-dd HH:mm:ss') AS event_time; 
joined = JOIN data1 BY id, data2 BY user_id;
matched = FILTER joined BY (SecondsBetween(start_time, event_time) < (long) 0) AND (SecondsBetween(event_time, stop_time) < (long) 0);
dump matched;

data1.txt (应以制表符分隔)

abc abc 2015-01-01 00:00:00 2015-01-02 00:00:00

data2.txt (应以制表符分隔)

abc abc 2015-01-01 01:00:00