我有这样的记录。
12:-64:12033:24:0:0:1495532058:1384:0:0:0:102
23:-64:8820:24:0:0:1495532126:2788:0:0:0:102
23:-64:8826:24:0:0:1495532132:3064:0:0:0:102
23:-64:8826:24:0:0:1495532132:3065:0:0:0:102
我想过滤猪中重复或相同的行。
注意:我不想删除或删除重复的行。我需要过滤重复的行并存储到一个变量。任何帮助将不胜感激。
out1 = GROUP out BY ($1,$7,$11);
records3 = FOREACH out1 {
top_record1 = LIMIT out 1;
final_rec = DISTINCT top_record1;
GENERATE FLATTEN(final_rec);
};
答案 0 :(得分:0)
这应该这样做。
a = load 'file' using PigStorage(':');
b = group a by ($1, $7, $11);
c = foreach b generate flatten(group), COUNT(a) as (cnt: int);
d = filter c by cnt>1;
e = foreach d generate flatten(a) ;