如何检查PIG

时间:2015-08-03 07:04:57

标签: apache-pig

我有以下数据集,我需要根据Car的公司名称执行一些步骤。

            (23,Nissan,12.43)
            (23,Nissan Car,16.43)
            (23,Honda Car,13.23)
            (23,Toyota Car,17.0)
            (24,Honda,45.0)
            (24,Toyota,12.43)
            (24,Nissan Car,12.43)


          A = LOAD 'data.txt' AS (code:int, name:chararray, rating:double);
          G = GROUP A by (code, REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1));
            DUMP G;

我根据代码及其基本公司名称对汽车进行分组,例如所有'Nissan'和'Nissan Car'记录应该分为1组,其他类似。

    /* Grouped data based on code and company's first name*/ 
            ((23,Nissan),{(23,Nissan,12.43),(23,Nissan Car,16.43)})
            ((23,Honda),{(23,Honda Car,13.23)})
            ((23,Toyota),{(23,Toyota Car,17.0)})
            ((24,Nissan),{(24,Nissan Car,12.43)})
            ((24,Honda),{(24,Honda,45.0)})
            ((24,Toyota),{(24,Toyota,12.43)})

现在,我想根据组是否包含与组名相对应的元组来过滤掉组。如果是,则从该组中取出该元组并忽略其他元组,如果不存在这样的元组,则取该组的所有元组。

输出应为:

            ((23,Nissan),{(23,Nissan,12.43)})  // Since this group contains a row with group's name i.e. Nissan
            ((23,Honda),{(23,Honda Car,13.23)})
            ((23,Toyota),{(23,Toyota Car,17.0)})
            ((24,Nissan),{(24,Nissan Car,12.43)})
            ((24,Honda),{(24,Honda,45.0)})
            ((24,Toyota),{(24,Toyota,12.43)})

            R = FOREACH G { OW = FILTER A BY name==group.$1; IF COUNT(OW) > 0}

有人可以帮忙我该怎么办?按群组名称过滤后?如何查找已过滤元组的计数并获取所需数据。

1 个答案:

答案 0 :(得分:1)

确定。让我们考虑下面的记录是你的输入。

23,Nissan,12.43
23,Nissan Car,16.43
23,Honda Car,13.23
23,Toyota Car,17.0
24,Honda,45.0
24,Toyota,12.43
25,Toyato Car,23.8
25,Toyato Car,17.2
24,Nissan Car,12.43 

对于上面的输入,假设下面是中间输出

((23,Honda),{(23,Honda,Honda Car,13.23)})
((23,Nissan),{(23,Nissan,Nissan,12.43),(23,Nissan,Nissan Car,16.43)})
((23,Toyota),{(23,Toyota,Toyota Car,17.0)})
((24,Honda),{(24,Honda,Honda,45.0)})
((24,Nissan),{(24,Nissan,Nissan Car,12.43)})
((24,Toyota),{(24,Toyota,Toyota,12.43)})
((25,Toyato),{(25,Toyato,Toyato Car,23.8),(25,Toyato,Toyato Car,17.2)})

请注意,从上面的中间输出中,您正在根据您的要求寻找以下输出。

(23,Honda,1)
(23,Nissan,1)
(23,Toyota,1)
(24,Honda,1)
(24,Nissan,1)
(24,Toyota,1)
(25,Toyato,2)

以下是代码..

nissan_load = LOAD '/user/cloudera/inputfiles/nissan.txt' USING PigStorage(',') as(code:int,name:chararray,rating:double);

nissan_each = FOREACH nissan_load GENERATE code,TRIM(REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1)) as brand_name,name,rating;

nissan_grp = GROUP nissan_each by (code,brand_name);


nissan_final_each =FOREACH nissan_grp {
             A = FOREACH nissan_each GENERATE (brand_name == TRIM(name) ? 1 :0) as cnt;
             B = (int)SUM(A);

             C = FOREACH nissan_each  GENERATE (brand_name != TRIM(name) ?1: 0) as extra_cnt;
             D = SUM(C);

             generate flatten(group) as(code,brand_name), (SUM(A.cnt) != 0 ? B : D) as final_cnt;
 };


dump nissan_final_each;

尝试使用不同输入的代码..