我正在运行一个PIG脚本,一切都很快,直到我到达FOREACH ... GENERATE FLATTEN(...)
行。
是否有理由认为该行应该如此缓慢地运行。 (它导致整个脚本在相当强大的集群上超时)
extended = FOREACH kRecords GENERATE *, NORMALIZE(query) AS query_norm:chararray;
-- DESCRIBE extended;
-- extended: {query: chararray,url: chararray,query_norm: chararray}
-- GROUP by both query and url
grouped = GROUP extended BY (query_norm, url);
-- DESCRIBE grouped;
-- grouped: {group: (query_norm: chararray,url: chararray),extended: {(query: chararray,url: chararray,query_norm: chararray)}}
-- Remove multiple items per record (but at the expense of duplicating records)
-- THE LINE BELOW IS THE SLOW ONE!!!
flattened = FOREACH grouped GENERATE FLATTEN(extended.query_norm), FLATTEN(extended.url);
-- THE LINE ABOVE IS THE SLOW ONE!!!
-- Remove duplicates
result = DISTINCT flattened;
谢谢, 百里
答案 0 :(得分:2)
在GENERATE之后使用2个FLATTEN(...)运算符时,您可以在2个包之间获得笛卡尔积。因此,如果GROUP生产的包有N个元素,在同一个包上运行2个FLATTEN(..)运算符后,每个组将生成N * N行,它可以对CPU,HDD和网络造成严重影响。请参阅以下示例:
<强> CODE:强>
inpt = load '/pig_fun/input/group.txt' as (c1, c2);
grp = group inpt by (c1, c2);
flt = foreach grp generate FLATTEN(inpt.c1), FLATTEN(inpt.c2);
<强> INPUT:强>
1 a
1 a
1 b
1 b
1 c
<强>输出:强>
(1,a)
(1,a)
(1,a)
(1,a)
(1,b)
(1,b)
(1,b)
(1,b)
(1,c)
查看(1,a)的2个记录和(1,b)中的2个记录如何分别产生4个输出记录。但是(1,c)的1个记录只引起了1个输出记录。