如何在嵌套的FOREACH中使用DESCRIBE和DUMP

时间:2014-08-08 16:31:14

标签: hadoop foreach apache-pig

我是Pig的新人,有时我需要访问嵌套FOREACH内部关系的模式。例如:

A = LOAD 'data' AS (url:chararray,outlink:chararray);

DUMP A;
(www.ccc.com,www.hjk.com)
(www.ddd.com,www.xyz.org)
(www.aaa.com,www.cvn.org)
(www.www.com,www.kpt.net)
(www.www.com,www.xyz.org)
(www.ddd.com,www.xyz.org)

B = GROUP A BY url;

DUMP B;
(www.aaa.com,{(www.aaa.com,www.cvn.org)})
(www.ccc.com,{(www.ccc.com,www.hjk.com)})
(www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)})
(www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})

X = FOREACH B {
        FA = FILTER A BY outlink == 'www.xyz.org';
        PA = FA.outlink;
        DA = DISTINCT PA;
        GENERATE group, COUNT(DA);
}

DUMP X;
(www.aaa.com,0)
(www.ccc.com,0)
(www.ddd.com,1)
(www.www.com,1)

我想知道FA,PA和DA的结构是什么。我曾尝试在DESCRIBE块中使用FOREACH,但它会出错:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 13, column 13>  Syntax error, unexpected symbol at or near 'FA'

有没有办法在嵌套FOREACH内获取关系的架构和结构只是为了学习目的?

1 个答案:

答案 0 :(得分:2)

在GENERATE语句中进行多次运行并投影FA / PA / DA。预测FA的示例代码:

X = FOREACH B {
    FA = FILTER A BY outlink == 'www.xyz.org';
    --PA = FA.outlink;
    --DA = DISTINCT PA;
    GENERATE group, FA;
}

DUMP X;
DESCRIBE X;