多栏结合独特的猪

时间:2014-04-16 09:41:29

标签: apache-pig

我想对列的子集执行DISTINCT操作。

A = LOAD 'data' AS(a1,a2,a3,a4,a5,a6);

DUMP A;

(1, 2, 3, 4,5,5_1)

(1, 2, 3, 4,5,5_1)

(1, 2, 3, 4,6,6_1)

(1 ,2, 4, 4,7,7_1)

(1, 2, 4, 4,8,8_1) 

-- insert DISTINCT operation on a1,a2,a3,a4 here:

-- ...
DUMP A_unique;

(1, 2, 3, 4,5,5_1)

(1, 2, 4, 4,7,7_1)

我已经提到过这个链接:

  

How to perform a DISTINCT in Pig Latin on a subset of columns?

并使用以下两种方式:

方法1

1.DATA = LOAD '/usr/local/Input.txt' AS (a1,a2,a3,a4,a5,a6);    
2.DATA2 = FOREACH DATA GENERATE TOTUPLE(a1,a2,a3,a4) AS combined, a5 as a5,a6 as a6;
3.grouped_by_a5_a6 = GROUP DATA2 BY combined;

4.grouped_and_distinct = FOREACH grouped_by_a5_a6 {

             combined_unique =LIMIT DATA2 1;

                   GENERATE FLATTEN(combined_unique);
};

方法2

DATA = LOAD '/usr/local/Input.txt' AS (a1,a2,a3,a4,a5,a6) ;        
A2 = FOREACH DATA GENERATE TOTUPLE(a1,a2,a3,a4) AS combined, a5 as a5,a6 as a6 ;

grouped_by_a5_a6 = GROUP A2 BY (a5,a6);

grouped_and_distinct = FOREACH grouped_by_a5_a6 {

        combined_unique = DISTINCT A2.combined;

        GENERATE FLATTEN(combined_unique);
};

但我得到的回答是:

(1, 2, 3, 4,5,5_1)
(1, 2, 3, 4,6,6_1)
(1, 2, 4, 4,7,7_1)
(1, 2, 4, 4,8,8_1) 

而不是:

(1, 2, 3, 4,5,5_1)
(1, 2, 4, 4,7,7_1)

上述代码有什么问题?

1 个答案:

答案 0 :(得分:0)

您期望的不是这些领域的明显结果。要获得所需的输出,您必须应用过滤器。