Apache Pig:组运算符和变量模式

时间:2014-11-13 23:48:00

标签: apache-pig

我有以下问题: 我的sample.txt的片段。第一列是id,但每行可以有不同数量的列。

id1 100 200 300 400 500

id2 10 20 30

id1 800 900 600

id3 10 20 30 40 50 60 70 80 90 100

id1 1 2 3 4 5 6 7 8 9

id2 40 50 60 70 80 90

id3 200

sample = LOAD' sample.txt' [我应该如何在这里指定架构]

sample_grpd = GROUP样本$ 0;

sample_result = FOREACH sample_grpd生成组,FLATTEN(TOBAG([应该去哪里]))

按ID分组,结果为:

id1 100 200 300 400 500 800 900 600 1 2 3 4 5 6 7 8 9

id2 10 20 30 40 50 60 70 80 90

id3 10 20 30 40 50 60 70 80 90 100 200

对此有任何帮助,将不胜感激!

1 个答案:

答案 0 :(得分:0)

这是一个棘手的问题,最后我以某种方式使用UDF解决了它。

<强> input.txt中

id1 100 200 300 400 500
id2 10 20 30
id1 800 900 600
id3 10 20 30 40 50 60 70 80 90 100
id1 1 2 3 4 5 6 7 8 9
id2 40 50 60 70 80 90
id3 200

<强> PigScript:

REGISTER removeduplicate.jar;
A = LOAD 'input.txt' USING PigStorage(' ');
B = GROUP A by $0;
C = FOREACH B GENERATE $0 AS myid:chararray,$1 AS (B:{T:(f1:chararray)});
D = FOREACH C GENERATE myid,BagToString(B) AS concatString;
E = FOREACH D GENERATE myid,mypackage.REMOVEDUPLICATE(concatString) AS finalString;
F = FOREACH E GENERATE myid,FLATTEN(STRSPLIT(finalString,'_',40)) AS result;
STORE F INTO 'output' USING PigStorage(' ');

<强>输出:

id1 100 200 300 400 500 800 900 600 1 2 3 4 5 6 7 8 9
id2 10 20 30 40 50 60 70 80 90
id3 10 20 30 40 50 60 70 80 90 100 200

UDF CODE 以下java类文件编译并生成为 removeduplicate.jar
REMOVEDUPLICATE.java

  package mypackage;
    import java.io.IOException;
    import org.apache.commons.lang.StringUtils;
    import org.apache.pig.EvalFunc;
    import org.apache.pig.data.Tuple;

    public class REMOVEDUPLICATE extends EvalFunc<String> {
    @Override
    public String exec(Tuple arg0) throws IOException {
           try
            {
                String input = ((String) arg0.get(0));
                    String duplicateString = input.split("\\_")[0]+"_";
                    System.out.println(duplicateString);
                    return(input.replace(duplicateString, ""));
            }
            catch(Exception e)
            {
                throw new IOException("Caught exception while processing the input row ", e);
            }
        }
    }