当选择加入或过滤Pig时,这是性能密集型的吗?
答案 0 :(得分:1)
连接总是很昂贵,因为你必须在表一中扫描每个元组的第二个表。请考虑以下示例
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = LOAD 'data2' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
X = JOIN A BY a1, B BY b1;
DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
当我们加入X时,我们遍历A中每个元组的B中的每个元组。对于过滤器,我们只遍历数据集并对每个元组执行过滤操作。
X = FILTER A BY a3 == 3;
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)