如何在Pig中获得包和元组的交集?

时间:2014-07-11 08:01:36

标签: apache-pig

我有这样一个包(url:chararray mal:float)并且喜欢这个(url:chararray链接:chararray)。 我想解析链接字段并将包与解析链接相交:

src = LOAD 'hbase://$collection' USING 

org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:url anchors:links', '-loadKey true') AS (id:bytearray, url:chararray, links:chararray);
mals = LOAD '/tmp/prepare' as (url:chararray, mal:float);

urls = FILTER src BY (links IS NOT null);

urls2 = FOREACH urls GENERATE TOKENIZE(links, '\t') as links, id, url;
processed = FOREACH urls2 {
    grouped = COGROUP links BY $0, mals BY url;
    intersected = FILTER grouped BY NOT IsEmpty(urls) AND NOT IsEmpty(links4);
    weights = FOREACH intersected GENERATE mal;
    GENERATE id, AVG(weights) as mal;
};

此代码无效:解析器失败并显示:

[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file ./Rank.pig, line 11, column 19>  [query, statement, foreach_statement, foreach_complex_statement, foreach_clause_complex, foreach_plan_complex, nested_blk, nested_command_list, nested_command, expr, add_expr, multi_expr, cast_expr, unary_expr, expr_eval, var_expr, projectable_expr, func_eval, recoverFromMismatchedToken] mismatched input 'links' expecting LEFT_PAREN

我使用Pig 0.11.0。

据我所知,链接是元组,而mals是包,所以它们不能被合并。如何创建一个带有指向cogroup的链接的包?

UPD: 示例数据集:

/tmp/prepare: 
http://1 1.0
http://2 0.9
http://3 0.8
http://4 0.0

HBase:
id: ID
url: http://4
links: http://1 http://2 http://3

作为输出:

{(id: ID, mal: 0.9)}

0 个答案:

没有答案