Pig“0.1”的猪“Max”命令和Hadoop-2.4.0的pig-0.13.0

时间:2014-07-30 00:23:29

标签: hadoop apache-pig

我有一个来自Hortonworks的猪脚本,与猪-0.9.2.15和Hadoop-1.0.3.16配合使用。但是当我在Hadoop-2.4.0上使用pig-0.12.1(使用-Dhadoopversion = 23重新编译)或pig-0.13.0运行它时,它不会起作用。

以下几行似乎就是问题所在。

max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;

这是整个剧本。

batting = load 'pig_data/Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
STORE join_data INTO './join_data';

这里是hadoop错误信息:

  

2014-07-29 18:03:02,957 [主要]错误   org.apache.pig.tools.pigstats.PigStats - 错误0:   org.apache.pig.backend.executionengine.ExecException:错误0:   执行时出现异常(名称:grp_data:Local   重新排列[元组] {bytearray}(false) - 范围-34运算符键:范围-34):   org.apache.pig.backend.executionengine.ExecException:ERROR 2106:   执行代数函数时出错2014-07-29 18:03:02,958 [main]   错误org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1张地图   减少工作失败!

如果我仍然想使用" MAX"我该怎么办?功能?谢谢!

以下是完整的信息:

  

14/07/29 17:50:11 INFO pig.ExecTypeProvider:尝试ExecType:LOCAL   14/07/29 17:50:11 INFO pig.ExecTypeProvider:尝试ExecType:   MAPREDUCE 14/07/29 17:50:11 INFO pig.ExecTypeProvider:Picked   MAPREDUCE为ExecType 2014-07-29 17:50:12,104 [main] INFO   org.apache.pig.Main - Apache Pig版本0.13.0(r1606446)编译   2014年6月29日,2014-07-29 02:27:58 17:50:12,104 [主要]信息   org.apache.pig.Main - 将错误消息记录到:   /root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log   2014-07-29 17:50:13,050 [main] INFO org.apache.pig.impl.util.Utils -   未找到默认启动文件/root/.pigbootup 2014-07-29 17:50:13,415   [主要] INFO org.apache.hadoop.conf.Configuration.deprecation -   mapred.job.tracker已弃用。相反,使用   mapreduce.jobtracker.address 2014-07-29 17:50:13,415 [主要]信息   org.apache.hadoop.conf.Configuration.deprecation - fs.default.name是   弃用。相反,使用fs.defaultFS 2014-07-29 17:50:13,415 [main]   INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -   连接到hadoop文件系统:   hdfs://namenode.cmda.hadoop.com:8020 2014-07-29 17:50:14,302 [main]   INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -   连接到map-reduce作业跟踪器:namenode.cmda.hadoop.com:8021   2014-07-29 17:50:14,990 [主要] INFO   org.apache.hadoop.conf.Configuration.deprecation - fs.default.name是   弃用。相反,使用fs.defaultFS 2014-07-29 17:50:15,570 [main]   INFO org.apache.hadoop.conf.Configuration.deprecation -   不推荐使用fs.default.name。相反,请使用fs.defaultFS 2014-07-29   17:50:15,665 [main] WARN org.apache.pig.newplan.BaseOperatorPlan -   遇到警告IMPLICIT_CAST_TO_DOUBLE 1次。 2014年7月29日   17:50:15,705 [主要]信息   org.apache.hadoop.conf.Configuration.deprecation -   不推荐使用mapred.textoutputformat.separator。相反,使用   mapreduce.output.textoutputformat.separator 2014-07-29 17:50:15,791   [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig功能   脚本中使用的:HASH_JOIN,GROUP_BY 2014-07-29 17:50:15,873 [main]   INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -   {RULES_ENABLED = [AddForEach,ColumnMapKeyPrune,   GroupByConstParallelSetter,LimitOptimizer,LoadTypeCastInserter,   MergeFilter,MergeForEach,PartitionFilterOptimizer,   PushDownForEachFlatten,PushUpFilter,SplitFilter,   StreamTypeCastInserter]   RULES_DISABLED = [FilterLogicExpressionSimplifier]} 2014-07-29   17:50:16,319 [主要]信息   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler    - 文件级联阈值:100乐观? false 2014-07-29 17:50:16,377 [main] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer    - 选择将代数foreach移动到组合器2014-07-29 17:50:16,410 [main] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler $ LastInputStreamingOptimizer    - 重写:POPackage-> POForEach到POPackage(JoinPackager)2014-07-29 17:50:16,417 [main] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer    - 优化前的MR计划规模:3 2014-07-29 17:50:16,418 [main] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer    - 合并1个地图减少分裂者。 2014-07-29 17:50:16,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer    - 合并了3个MR运营商中的1个。 2014-07-29 17:50:16,418 [主要] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer    - 优化后的MR计划规模:2 2014-07-29 17:50:16,493 [main] INFO org.apache.hadoop.conf.Configuration.deprecation -   不推荐使用fs.default.name。相反,请使用fs.defaultFS 2014-07-29   17:50:16,575 [main] INFO org.apache.hadoop.yarn.client.RMProxy -   连接到ResourceManager   namenode.cmda.hadoop.com/10.0.3.1:8050 2014-07-29 17:50:16,973 [main]   INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig   脚本设置被添加到作业中2014-07-29 17:50:17,007 [main]   INFO org.apache.hadoop.conf.Configuration.deprecation -   不推荐使用mapred.job.reduce.markreset.buffer.percent。相反,使用   mapreduce.reduce.markreset.buffer.percent 2014-07-29 17:50:17,007   [主要] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler    - mapred.job.reduce.markreset.buffer.percent未设置,设置为默认值0.3 2014-07-29 17:50:17,007 [main] INFO   org.apache.hadoop.conf.Configuration.deprecation -   不推荐使用mapred.output.compress。相反,使用   mapreduce.output.fileoutputformat.compress 2014-07-29 17:50:17,020   [主要] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler    - 减少检测到的相位,估算所需减速器的数量。 2014-07-29 17:50:17,020 [主要] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler    - 使用reducer估计器:org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator   2014-07-29 17:50:17,064 [主要] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator    - BytesPerReducer = 1000000000 maxReducers = 999 totalInputFileSize = 6398990 2014-07-29 17:50:17,067 [main] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler    - 将Parallelism设置为1 2014-07-29 17:50:17,067 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks   已弃用。相反,请使用mapreduce.job.reduces 2014-07-29   17:50:17,068 [主要]信息   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler    - 此作业无法转换运行进程2014-07-29 17:50:17,068 [主要]信息   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler    - 创建jar文件Job2337803902169382273.jar 2014-07-29 17:50:20,957 [main] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler    - jar文件Job2337803902169382273.jar创建于2014-07-29 17:50:20,957 [main] INFO org.apache.hadoop.conf.Configuration.deprecation -   mapred.jar已弃用。相反,请使用mapreduce.job.jar 2014-07-29   17:50:21,001 [主要] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler    - 设置多店铺工作2014-07-29 17:50:21,036 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple]是   false,不会生成代码。 2014-07-29 17:50:21,036 [主要] INFO   org.apache.pig.data.SchemaTupleFrontend - 开始移动的过程   为分布式cacche生成的代码2014-07-29 17:50:21,046 [main]   INFO org.apache.pig.data.SchemaTupleFrontend - 设置密钥   [pig.schematuple.classes]包含要反序列化的类[] 2014-07-29   17:50:21,310 [主要]信息   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - 1个map-reduce等待提交的工作。 2014-07-29 17:50:21,311 [main] INFO org.apache.hadoop.conf.Configuration.deprecation -   不推荐使用mapred.job.tracker.http.address。相反,使用   mapreduce.jobtracker.http.address 2014-07-29 17:50:21,332 [JobControl]   INFO org.apache.hadoop.yarn.client.RMProxy - 正在连接到   ResourceManager at namenode.cmda.hadoop.com/10.0.3.1:8050 2014-07-29   17:50:21,366 [JobControl]信息   org.apache.hadoop.conf.Configuration.deprecation - fs.default.name是   弃用。相反,请使用fs.defaultFS 2014-07-29 17:50:22,606   [JobControl]信息   org.apache.hadoop.mapreduce.lib.input.FileInputFormat - 总输入   处理路径:1 2014-07-29 17:50:22,606 [JobControl]信息   org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - 总计   要处理的输入路径:1 2014-07-29 17:50:22,629 [JobControl] INFO   org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - 总计   输入路径(合并)处理:1 2014-07-29 17:50:22,729   [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number   分裂:1 2014-07-29 17:50:22,745 [JobControl]信息   org.apache.hadoop.conf.Configuration.deprecation - fs.default.name是   弃用。相反,请使用fs.defaultFS 2014-07-29 17:50:23,026   [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter -   提交工作代币:job_1406677482986_0003 2014-07-29   17:50:23,258 [JobControl]信息   org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - 已提交   application application_1406677482986_0003 2014-07-29 17:50:23,340   [JobControl] INFO org.apache.hadoop.mapreduce.Job - 要跟踪的网址   工作:   http://namenode.cmda.hadoop.com:8088/proxy/application_1406677482986_0003/   2014-07-29 17:50:23,340 [主要] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - HadoopJobId:job_1406677482986_0003 2014-07-29 17:50:23,340 [main] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - 处理别名batting,grp_data,max_runs,运行2014-07-29 17:50:23,340 [main] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - 详细位置:M:击球[3,10],跑[5,7],max_runs [7,11],grp_data [6,11] C:   max_runs [7,11],grp_data [6,11] R:max_runs [7,11] 2014-07-29   17:50:23,340 [主要] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - 有关详情,请访问:http://namenode.cmda.hadoop.com:50030/jobdetails.jsp?jobid=job_1406677482986_0003   2014-07-29 17:50:23,357 [主要] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - 0%完成2014-07-29 17:50:23,357 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - 正在运行的工作是[job_1406677482986_0003] 2014-07-29 17:51:15,564 [main] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - 50%完成2014-07-29 17:51:15,564 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - 正在运行的工作是[job_1406677482986_0003] 2014-07-29 17:51:18,582 [主要]警告   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - 哎呀!有些工作失败了!如果希望Pig在失败时立即停止,请指定-stop_on_failure。 2014-07-29 17:51:18,582 [主要] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - 工作job_1406677482986_0003失败了!停止运行所有依赖的工作2014-07-29 17:51:18,582 [main] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - 100%完成2014-07-29 17:51:18,825 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0:   org.apache.pig.backend.executionengine.ExecException:错误0:   执行时出现异常(名称:grp_data:Local   重新排列[元组] {bytearray}(false) - scope-73运算符键:scope-73):   org.apache.pig.backend.executionengine.ExecException:ERROR 2106:   执行代数函数时出错2014-07-29 17:51:18,825 [main]   错误org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1张地图   减少工作失败! 2014-07-29 17:51:18,826 [主要] INFO   org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - 脚本   统计数据:

     

HadoopVersion PigVersion UserId StartedAt FinishedAt功能   2.4.0 0.13.0 root 2014-07-29 17:50:16 2014-07-29 17:51:18 HASH_JOIN,GROUP_BY

     

失败!

     

失败的工作:JobId别名功能消息输出   job_1406677482986_0003击球,grp_data,max_runs,运行MULTI_QUERY,COMBINER消息:   工作失败了!

     

输入:无法从中读取数据   " HDFS://namenode.cmda.hadoop.com:8020 /用户/根/ pig_data / Batting.csv"

     

输出(一个或多个):

     

计数器:写入的总记录数:0写入的总字节数:0可溢出   内存管理器泄漏计数:0主动溢出的行李总数:0总计   主动泄漏的记录:0

     

Job DAG:job_1406677482986_0003 - > null,null

     

2014-07-29 17:51:18,826 [main] INFO   org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher    - 失败了! 2014-07-29 17:51:18,827 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2106:执行错误   日志文件中的代数函数详细信息:   /root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log   2014-07-29 17:51:18,828 [主要]错误   org.apache.pig.tools.grunt.GruntParser - ERROR 2244:工作范围-58   失败,hadoop不会返回任何错误消息日志文件中的详细信息:   /root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log

3 个答案:

答案 0 :(得分:1)

尝试通过转换MAX函数

max_runs = FOREACH grp_data GENERATE group as grp,(int)MAX(runs.runs)as max_runs;

希望它能运作

答案 1 :(得分:1)

您应该在load语句中使用数据类型。

runs = FOREACH batting GENERATE $0 as playerID:chararray, $1 as year:int, $8 as runs:int;

如果出于某种原因这没有帮助,请尝试显式转换。

max_runs = FOREACH grp_data GENERATE group as grp, MAX((int)runs.runs) as max_runs;

答案 2 :(得分:0)

感谢@BigData和@Mikko Kupsu的提示。问题确实有一些事情可以做数据类型转换。

如下所示指定每列的数据类型后,一切都运行良好。

batting = 
    LOAD '/user/root/pig_data/Batting.csv' USING PigStorage(',')
    AS (playerID: CHARARRAY, yearID: INT, stint: INT, teamID: CHARARRAY, lgID: CHARARRAY,
    G: INT, G_batting: INT, AB: INT, R: INT, H: INT, two_B: INT, three_B: INT, HR: INT, RBI: INT, 
    SB: INT, CS: INT, BB:INT, SO: INT, IBB: INT, HBP: INT, SH: INT, SF: INT, GIDP: INT, G_old: INT);