我有以下数据样本:
AGE,EDU,SEX,SALARY
67,10th,Male,<=50K
17,10th,Female,<=50K
40,Assoc-voc,Male,>50K
35,Assoc-voc,Male,<=50K
57,Assoc-voc,Male,<=50K
49,Assoc-voc,Male,>50K
42,Bachelors,Male,>50K
30,Bachelors,Male,>50K
23,Bachelors,Female,<=50K
============================================== < / p>
我创建了以下Pig Latin / hadoop脚本:
sensitive = LOAD '/mdsba' using PigStorage(',') as (AGE,EDU,SEX,SALARY);
*--Filtered the data by the salary
Data_filter1 = FILTER sensitive by (SALARY matches '<=50K');
Data_filter2 = FILTER sensitive by (SALARY matches '>50K');
--group both filters
B= foreach(group Data_filter1 by(AGE,EDU,SEX))
generate Data_filter1;
C= foreach(group Data_filter2 by(AGE,EDU,SEX))
generate Data_filter2;
Dump B ;
Dump C ;
=============================================== ==============
有没有办法确定查询B,C,Data_filter1或Data_filter2是在Map还是Reduce进程上运行。由于在作业结束时生成以下报告:
Elapsed: 35sec
Diagnostics:
Average Map Time: 12sec
Average Shuffle Time: 10sec
Average Merge Time: 0sec
Average Reduce Time: 2sec
非常感谢
答案 0 :(得分:0)
是的,当你启动这份工作时,你会看到一个字符串
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: Alias1[73,14] C: Alias2[20, 9] R: Alias3[90, 78]
M代表映射器,C代表组合器,R代表减速器。但在一般情况下,您的查询可能会在mapper和reducer上运行