PIG拉丁语很慢

时间:2012-06-19 10:54:08

标签: apache-pig

我正在运行一个PIG脚本,一切都很快,直到我到达FOREACH ... GENERATE FLATTEN(...)行。

是否有理由认为该行应该如此缓慢地运行。 (它导致整个脚本在相当强大的集群上超时)

extended = FOREACH kRecords GENERATE *, NORMALIZE(query) AS query_norm:chararray;
-- DESCRIBE extended;
-- extended: {query: chararray,url: chararray,query_norm: chararray}

-- GROUP by both query and url
grouped = GROUP extended BY (query_norm, url);
-- DESCRIBE grouped;
-- grouped: {group: (query_norm: chararray,url: chararray),extended: {(query: chararray,url: chararray,query_norm: chararray)}}

-- Remove multiple items per record (but at the expense of duplicating records)
-- THE LINE BELOW IS THE SLOW ONE!!!
flattened = FOREACH grouped GENERATE FLATTEN(extended.query_norm), FLATTEN(extended.url);
-- THE LINE ABOVE IS THE SLOW ONE!!!

-- Remove duplicates
result = DISTINCT flattened;

谢谢, 百里

1 个答案:

答案 0 :(得分:2)

在GENERATE之后使用2个FLATTEN(...)运算符时,您可以在2个包之间获得笛卡尔积。因此,如果GROUP生产的包有N个元​​素,在同一个包上运行2个FLATTEN(..)运算符后,每个组将生成N * N行,它可以对CPU,HDD和网络造成严重影响。请参阅以下示例:

<强> CODE:

inpt = load '/pig_fun/input/group.txt' as (c1, c2);
grp = group inpt by (c1, c2);
flt = foreach grp generate FLATTEN(inpt.c1), FLATTEN(inpt.c2);

<强> INPUT:

1       a
1       a
1       b
1       b
1       c

<强>输出:

(1,a)
(1,a)
(1,a)
(1,a)
(1,b)
(1,b)
(1,b)
(1,b)
(1,c)

查看(1,a)的2个记录和(1,b)中的2个记录如何分别产生4个输出记录。但是(1,c)的1个记录只引起了1个输出记录。