过滤掉PIG中的重复项

时间:2017-05-29 07:45:23

标签: hadoop apache-pig

我有这样的记录。

12:-64:12033:24:0:0:1495532058:1384:0:0:0:102
23:-64:8820:24:0:0:1495532126:2788:0:0:0:102
23:-64:8826:24:0:0:1495532132:3064:0:0:0:102
23:-64:8826:24:0:0:1495532132:3065:0:0:0:102

我想过滤猪中重复或相同的行。

注意:我不想删除或删除重复的行。我需要过滤重复的行并存储到一个变量。任何帮助将不胜感激。

out1 = GROUP out BY ($1,$7,$11);
records3 = FOREACH out1 {
top_record1 = LIMIT out 1;
final_rec = DISTINCT top_record1;
GENERATE FLATTEN(final_rec);
};

1 个答案:

答案 0 :(得分:0)

这应该这样做。

a = load 'file' using PigStorage(':');
b = group a by ($1, $7, $11);
c = foreach b generate flatten(group), COUNT(a) as (cnt: int);
d = filter c by cnt>1;
e = foreach d generate flatten(a) ;