Pig - FilterFunc不占用整个元组

时间:2014-03-20 16:47:07

标签: hadoop user-defined-functions bigdata apache-pig

我的Pig's filterfuncs有一个问题。

但首先,我会给你上下文。

A = LOAD 'pig/hado/start_extrait2.csv' USING PigStorage(';') as (DAT_START:chararray, COD_IPUSER:chararray, NDI_START:chararray);

hado_search_file = LOAD 'pig/hado/recherche_hado.csv' USING PigStorage(';') as (DATE_HADO:chararray, IP_RECHERCHEE:chararray);

result2 = JOIN hado_search_file by IP_RECHERCHEE LEFT OUTER, A by COD_IPUSER;

让我们尝试vizualize" result2"变量:

describe result2;

{hado_search_file::DATE_HADO: chararray,hado_search_file::IP_RECHERCHEE: chararray,A::COD_IPUSER: chararray,A::DAT_START: chararray,A::NDI_START: chararray}

dump result2;

(2014/03/10 00:00:00,192.168.2.67,,,)
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,2014/03/10 00:00:00,0385578168)
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,2014/03/10 00:00:00,0385578168)
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,2014/03/10 00:00:00,0385578168)
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,2014/03/10 00:00:00,0385578168)

然后,我尝试使用FilterFunc

flt = FILTER result2 BY dateInferiorOrNull();

代码的开头是:

public class dateInferiorOrNull  extends FilterFunc {

    @Override
    public Boolean exec(Tuple input) throws IOException {

        System.out.println(input);

        ...

    }
}

我希望输出的输出与"转储结果2"相同;我之前做过,但相反,我有这样的事情:

(2014/03/10 00:00:00,79.92.147.88)

只取了两个第一个字段!

当我尝试显示元组大小时,程序会说元组的大小为2!

所以似乎过滤器函数不会将整个元组作为输入。

为什么会这样?

向你求助。

1 个答案:

答案 0 :(得分:-1)

从输入元组(您将作为参数传递给UDF)获取结果中所需的列,然后将其添加到DataBag,然后从UDF返回此DataBag。这个UDF的输出是一个Bag然后在你的猪脚本中展平这个Bag。