我有以下问题: 我的sample.txt的片段。第一列是id,但每行可以有不同数量的列。
id1 100 200 300 400 500
id2 10 20 30
id1 800 900 600
id3 10 20 30 40 50 60 70 80 90 100
id1 1 2 3 4 5 6 7 8 9
id2 40 50 60 70 80 90
id3 200
sample = LOAD' sample.txt' [我应该如何在这里指定架构]
sample_grpd = GROUP样本$ 0;
sample_result = FOREACH sample_grpd生成组,FLATTEN(TOBAG([应该去哪里]))
按ID分组,结果为:
id1 100 200 300 400 500 800 900 600 1 2 3 4 5 6 7 8 9
id2 10 20 30 40 50 60 70 80 90
id3 10 20 30 40 50 60 70 80 90 100 200
对此有任何帮助,将不胜感激!
答案 0 :(得分:0)
这是一个棘手的问题,最后我以某种方式使用UDF解决了它。
<强> input.txt中强>
id1 100 200 300 400 500
id2 10 20 30
id1 800 900 600
id3 10 20 30 40 50 60 70 80 90 100
id1 1 2 3 4 5 6 7 8 9
id2 40 50 60 70 80 90
id3 200
<强> PigScript:强>
REGISTER removeduplicate.jar;
A = LOAD 'input.txt' USING PigStorage(' ');
B = GROUP A by $0;
C = FOREACH B GENERATE $0 AS myid:chararray,$1 AS (B:{T:(f1:chararray)});
D = FOREACH C GENERATE myid,BagToString(B) AS concatString;
E = FOREACH D GENERATE myid,mypackage.REMOVEDUPLICATE(concatString) AS finalString;
F = FOREACH E GENERATE myid,FLATTEN(STRSPLIT(finalString,'_',40)) AS result;
STORE F INTO 'output' USING PigStorage(' ');
<强>输出:强>
id1 100 200 300 400 500 800 900 600 1 2 3 4 5 6 7 8 9
id2 10 20 30 40 50 60 70 80 90
id3 10 20 30 40 50 60 70 80 90 100 200
UDF CODE 以下java类文件编译并生成为 removeduplicate.jar
REMOVEDUPLICATE.java
package mypackage;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class REMOVEDUPLICATE extends EvalFunc<String> {
@Override
public String exec(Tuple arg0) throws IOException {
try
{
String input = ((String) arg0.get(0));
String duplicateString = input.split("\\_")[0]+"_";
System.out.println(duplicateString);
return(input.replace(duplicateString, ""));
}
catch(Exception e)
{
throw new IOException("Caught exception while processing the input row ", e);
}
}
}