猪去除重复事件在1分钟内发生

时间:2013-07-16 12:47:30

标签: hadoop mapreduce apache-pig

我们正在使用pig-0.11.0-cdh4.3.0和CDH4群集,我们需要删除一些网络日志。解决方案的想法(用SQL表示)是这样的:

SELECT
     T1.browser,
     T1.click_type,
     T1.referrer,
     T1.datetime,
     T2.datetime
FROM
     My_Table T1
INNER JOIN My_Table T2 ON
     T2.browser = T1.browser AND
     T2.click_type = T1.click_type AND
     T2.referrrer = T1.referrer AND
     T2.datetime > T1.datetime AND
     T2.datetime <= DATEADD(mi, 1, T1.datetime)

我从这里抓住了上面的SQL find duplicate records occuring within 1 minute of each other。我希望我可以在Pig中实现类似的解决方案,但我发现显然Pig不支持通过表达式(仅按字段)加入JOIN,如上面的连接所要求的那样。你知道如何用Pig去除1分钟附近的事件吗?谢谢!

4 个答案:

答案 0 :(得分:0)

从我的头脑中,这样的事情可以起作用,但需要测试:

view = FOREACH input GENERATE browser, click_type, referrer, datetime, GetYear(datetime) as year, GetMonth(datetime) as month, GetDay(datetime) as day, GetHour(datetime) as hour, GetMinute(datetime) as minute;
grp = GROUP view BY (browser, click_type, referrer, year, month, day, hour, minute);
uniq = FOREACH grp {
    top = LIMIT view 1;
    GENERATE FLATTEN(view.(browser, click_type, referrer, datetime))
}

原因如果一个事件发生在12:03:45而另一个事件发生在12:03:59,这些将在同一组中,12:05:00与12:05:00将在不同的组中。

要获得精确的60秒差异,您需要编写一个UDF,它将遍历分组在(浏览器,click_type,referrer)上的已排序包,并删除不需要的行。

答案 1 :(得分:0)

您可以通过所需参数

执行此类操作
         top3 = foreach grpd {
            sorted = filter records by time < 60;
            top    = limit sorted 2;
            generate group, flatten(top);
         };

答案 2 :(得分:0)

这将是另一种方法

   records_group = group records by (browser, click_type, referrer);

   with_min = FOREACH records_group 
   GENERATE
   FLATTEN(records), MAX(records.datetime) as maxDt ;

  filterRecords = filter with_min by (maxDt - $2 ) <60;

$ 2是数据时间位置相应地改变它

答案 3 :(得分:0)

Aleks和Marq,

  records_group = group records by (browser, click_type, referrer);

  with_min = FOREACH records_group 
           GENERATE FLATTEN(records), MAX(records.datetime) as max 

  with_min = FOREACH with_min GENERATE browser, click_type, referrer, 
            ABS(max - dateime) as maxDtgroup;

  regroup = group with_min by (browser, click_type, referrer, maxDtgroup);

使用maxDtGroup重新分组是关键并过滤前1条记录。