在下面的代码中,连接后重命名字段会损害脚本的计算时间?它是否在Pig中优化?还是真的经历过每一条记录?
-- tables A: (f1, f2, id) and B: (g1, g2, id) to be joined by id
C = JOIN A BY id, B by id;
C = FOREACH C GENERATE A::f1 AS f1, A::f2 AS f2, B::id AS id, B::g1 AS g1, B::g2 AS g2;
FOREACH
命令是否通过C的每个记录?如果是,是否有优化方法?
感谢。
答案 0 :(得分:9)
不要担心优化这一点,重命名字段可能会有轻微的开销,但它不会触发添加Map / Reduce作业。字段投影将在JOIN
。
考虑以下explain
给出的两段代码和Map Reduce计划。
A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);
C = join A by id, B by id;
store C into 'output';
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-30
Map Plan
Union[tuple] - scope-31
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
| | |
| | Project[bytearray][2] - scope-21
| |
| |---A: New For Each(false,false,false)[bag] - scope-7
| | |
| | Project[bytearray][0] - scope-1
| | |
| | Project[bytearray][1] - scope-3
| | |
| | Project[bytearray][2] - scope-5
| |
| |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
| |
| Project[bytearray][2] - scope-23
|
|---B: New For Each(false,false,false)[bag] - scope-15
| |
| Project[bytearray][0] - scope-9
| |
| Project[bytearray][1] - scope-11
| |
| Project[bytearray][2] - scope-13
|
|---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
----------------
A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);
C = join A by id, B by id;
C = foreach C generate A::f1 as f1, -- This
A::f2 as f2, -- section
B::id as id, -- is
B::g1 as g1, -- different
B::g2 as g2; --
store C into 'output';
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-41
Map Plan
Union[tuple] - scope-42
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
| | |
| | Project[bytearray][2] - scope-21
| |
| |---A: New For Each(false,false,false)[bag] - scope-7
| | |
| | Project[bytearray][0] - scope-1
| | |
| | Project[bytearray][1] - scope-3
| | |
| | Project[bytearray][2] - scope-5
| |
| |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
| |
| Project[bytearray][2] - scope-23
|
|---B: New For Each(false,false,false)[bag] - scope-15
| |
| Project[bytearray][0] - scope-9
| |
| Project[bytearray][1] - scope-11
| |
| Project[bytearray][2] - scope-13
|
|---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
| |
| Project[bytearray][0] - scope-27
| |
| Project[bytearray][1] - scope-29
| |
| Project[bytearray][5] - scope-31
| |
| Project[bytearray][3] - scope-33
| |
| Project[bytearray][4] - scope-35
|
|---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
----------------
区别在于Reduce计划。没有重命名:
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
与重命名:
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
| |
| Project[bytearray][0] - scope-27
| |
| Project[bytearray][1] - scope-29
| |
| Project[bytearray][5] - scope-31
| |
| Project[bytearray][3] - scope-33
| |
| Project[bytearray][4] - scope-35
|
|---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
简而言之,在您担心重命名之前,您可以在脚本中优化其他内容。由于join
,你无论如何都要经历每一条记录,因此重命名只是一个便宜的额外步骤。