如何在Pig中删除列的重复项

时间:2017-11-03 06:02:39

标签: hadoop bigdata apache-pig

任何人都会帮我从我的csv文件中删除旧记录并使用Pig保存最近的记录。

EX:输入

Key1 sta DATE

XXXXX P38 17-10-2017

XXXXX P38 12-10-2017

YYYYY P38 11-10-2017

YYYYY P38 23-09-2017

YYYYY P38 14-09-2017

ZZZZZ P38 25-10-2017

ZZZZZ P38 10-10-2017

我的预期输出是

Key1 sta DATE

XXXXX P38 17-10-2017

YYYYY P38 11-10-2017

ZZZZZ P38 25-10-2017

标题也包含在输出中。

请建议我如何实现这一目标?

2 个答案:

答案 0 :(得分:1)

嵌套的foreach可以用于这种情况,

A = LOAD '....' AS (
B =
    FOREACH (GROUP A BY key1) {
        orderd = ORDER A BY date DESC;
        ltsrow = LIMIT orderd 1;
        GENERATE FLATTEN(ltsrow);
    };
STORE B into 'output' using PigStorage('\t', '-schema');

要了解嵌套的foreach,请看这个, https://shrikantbang.wordpress.com/2014/01/14/apache-pig-group-by-nested-foreach-join-example/ https://community.mapr.com/thread/22034-apache-pig-nested-foreach-explaination

并使用架构保存输出, https://hadoopified.wordpress.com/2012/04/22/pigstorage-options-schema-and-source-tagging/

答案 1 :(得分:1)

以下将为您服务。

a = load 'pig.txt' USING PigStorage(' ') AS (name:chararray,code:chararray,x1:chararray);
b = FOREACH a GENERATE name,code,ToDate(x1,'dd-mm-yyyy') AS x1;
grpd = GROUP b BY name;
firstrecords = FOREACH grpd {
        sorted = order b by x1 desc;
        toprecord    = limit sorted 1;
        generate group,FLATTEN(toprecord);
};
dump firstrecords;