在JOIN之后重命名字段需要时间?

时间:2012-08-06 18:15:24

标签: apache-pig

在下面的代码中,连接后重命名字段会损害脚本的计算时间?它是否在Pig中优化?还是真的经历过每一条记录?

-- tables A: (f1, f2, id)  and B: (g1, g2, id) to be joined by id
C = JOIN A BY id, B by id;
C = FOREACH C GENERATE A::f1 AS f1, A::f2 AS f2, B::id AS id, B::g1 AS g1, B::g2 AS g2;

FOREACH命令是否通过C的每个记录?如果是,是否有优化方法?

感谢。

1 个答案:

答案 0 :(得分:9)

不要担心优化这一点,重命名字段可能会有轻微的开销,但它不会触发添加Map / Reduce作业。字段投影将在JOIN

之后的减速器中进行

考虑以下explain给出的两段代码和Map Reduce计划。

不重命名

A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);

C = join A by id, B by id;

store C into 'output';

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node scope-30
Map Plan
Union[tuple] - scope-31
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
|   |   |
|   |   Project[bytearray][2] - scope-21
|   |
|   |---A: New For Each(false,false,false)[bag] - scope-7
|       |   |
|       |   Project[bytearray][0] - scope-1
|       |   |
|       |   Project[bytearray][1] - scope-3
|       |   |
|       |   Project[bytearray][2] - scope-5
|       |
|       |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
    |   |
    |   Project[bytearray][2] - scope-23
    |
    |---B: New For Each(false,false,false)[bag] - scope-15
        |   |
        |   Project[bytearray][0] - scope-9
        |   |
        |   Project[bytearray][1] - scope-11
        |   |
        |   Project[bytearray][2] - scope-13
        |
        |---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
----------------

重命名

A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);

C = join A by id, B by id;
C = foreach C generate A::f1 as f1,  -- This
                       A::f2 as f2,  -- section
                       B::id as id,  -- is
                       B::g1 as g1,  -- different
                       B::g2 as g2;  --

store C into 'output';

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node scope-41
Map Plan
Union[tuple] - scope-42
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
|   |   |
|   |   Project[bytearray][2] - scope-21
|   |
|   |---A: New For Each(false,false,false)[bag] - scope-7
|       |   |
|       |   Project[bytearray][0] - scope-1
|       |   |
|       |   Project[bytearray][1] - scope-3
|       |   |
|       |   Project[bytearray][2] - scope-5
|       |
|       |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
    |   |
    |   Project[bytearray][2] - scope-23
    |
    |---B: New For Each(false,false,false)[bag] - scope-15
        |   |
        |   Project[bytearray][0] - scope-9
        |   |
        |   Project[bytearray][1] - scope-11
        |   |
        |   Project[bytearray][2] - scope-13
        |
        |---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
    |   |
    |   Project[bytearray][0] - scope-27
    |   |
    |   Project[bytearray][1] - scope-29
    |   |
    |   Project[bytearray][5] - scope-31
    |   |
    |   Project[bytearray][3] - scope-33
    |   |
    |   Project[bytearray][4] - scope-35
    |
    |---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
----------------

区别在于Reduce计划。没有重命名:

Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false

与重命名:

Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
    |   |
    |   Project[bytearray][0] - scope-27
    |   |
    |   Project[bytearray][1] - scope-29
    |   |
    |   Project[bytearray][5] - scope-31
    |   |
    |   Project[bytearray][3] - scope-33
    |   |
    |   Project[bytearray][4] - scope-35
    |
    |---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false

简而言之,在您担心重命名之前,您可以在脚本中优化其他内容。由于join,你无论如何都要经历每一条记录,因此重命名只是一个便宜的额外步骤。