我们正在使用pig-0.11.0-cdh4.3.0和CDH4群集,我们需要删除一些网络日志。解决方案的想法(用SQL表示)是这样的:
SELECT
T1.browser,
T1.click_type,
T1.referrer,
T1.datetime,
T2.datetime
FROM
My_Table T1
INNER JOIN My_Table T2 ON
T2.browser = T1.browser AND
T2.click_type = T1.click_type AND
T2.referrrer = T1.referrer AND
T2.datetime > T1.datetime AND
T2.datetime <= DATEADD(mi, 1, T1.datetime)
我从这里抓住了上面的SQL find duplicate records occuring within 1 minute of each other。我希望我可以在Pig中实现类似的解决方案,但我发现显然Pig不支持通过表达式(仅按字段)加入JOIN,如上面的连接所要求的那样。你知道如何用Pig去除1分钟附近的事件吗?谢谢!
答案 0 :(得分:0)
从我的头脑中,这样的事情可以起作用,但需要测试:
view = FOREACH input GENERATE browser, click_type, referrer, datetime, GetYear(datetime) as year, GetMonth(datetime) as month, GetDay(datetime) as day, GetHour(datetime) as hour, GetMinute(datetime) as minute;
grp = GROUP view BY (browser, click_type, referrer, year, month, day, hour, minute);
uniq = FOREACH grp {
top = LIMIT view 1;
GENERATE FLATTEN(view.(browser, click_type, referrer, datetime))
}
原因如果一个事件发生在12:03:45而另一个事件发生在12:03:59,这些将在同一组中,12:05:00与12:05:00将在不同的组中。
要获得精确的60秒差异,您需要编写一个UDF,它将遍历分组在(浏览器,click_type,referrer)上的已排序包,并删除不需要的行。
答案 1 :(得分:0)
您可以通过所需参数
执行此类操作 top3 = foreach grpd {
sorted = filter records by time < 60;
top = limit sorted 2;
generate group, flatten(top);
};
答案 2 :(得分:0)
这将是另一种方法
records_group = group records by (browser, click_type, referrer);
with_min = FOREACH records_group
GENERATE
FLATTEN(records), MAX(records.datetime) as maxDt ;
filterRecords = filter with_min by (maxDt - $2 ) <60;
$ 2是数据时间位置相应地改变它
答案 3 :(得分:0)
Aleks和Marq,
records_group = group records by (browser, click_type, referrer);
with_min = FOREACH records_group
GENERATE FLATTEN(records), MAX(records.datetime) as max
with_min = FOREACH with_min GENERATE browser, click_type, referrer,
ABS(max - dateime) as maxDtgroup;
regroup = group with_min by (browser, click_type, referrer, maxDtgroup);
使用maxDtGroup重新分组是关键并过滤前1条记录。