Apache Pig:在foreach引用toplevel字段中应用LIMIT

时间:2013-01-21 16:35:32

标签: hadoop apache-pig

我在foreach

中引用“父”字段时遇到问题
grunt> describe METRICS_SOURCE_WITH_CNT
METRICS_SOURCE_WITH_CNT: 
{group: (hostname: chararray,site_guid: chararray,timestamp: long),
JOIN_FIELDS_ONLY: {(timestamp: long, unique_pageviews: long)},cnt: long

请注意,cnt是元组的总和。

METRICS_SOURCE_TOP3 = foreach METRICS_SOURCE_WITH_CNT {

    SORTED = ORDER JOIN_FIELDS_ONLY by unique_pageviews DESC;
    TOPK = LIMIT SORTED 10;

    REVSORTED = ORDER JOIN_FIELDS_ONLY by unique_pageviews ASC;
    BOTTOMK = LIMIT REVSORTED cnt;

    generate TOPK, BOTTOMK;
}

但似乎当我应用第二个LIMIT时,Pig认为cnt字段在REVSORTED内,但它实际上是一个“父”字段。

Invalid field projection. Projected field [cnt] does not exist in schema: timestamp:long,....

我尝试按编号$x引用字段,但它不起作用。 Pig总是认为引用的字段在LIMIT'd

的关系范围内

1 个答案:

答案 0 :(得分:1)

您需要使用Pig dereference operator,它允许您使用.引用父级。以你的例子:

METRICS_SOURCE_TOP3 = foreach METRICS_SOURCE_WITH_CNT {

    SORTED = ORDER JOIN_FIELDS_ONLY by unique_pageviews DESC;
    TOPK = LIMIT SORTED 10;

    REVSORTED = ORDER JOIN_FIELDS_ONLY by unique_pageviews ASC;
    BOTTOMK = LIMIT REVSORTED METRICS_SOURCE_WITH_CNT.cnt;

    generate TOPK, BOTTOMK;
}

还有一点值得注意的是,在0.10 Pig之前,在LIMIT语句中不支持标量,所以这种语句会失败。