我有一个来自Hortonworks的猪脚本,与猪-0.9.2.15和Hadoop-1.0.3.16配合使用。但是当我在Hadoop-2.4.0上使用pig-0.12.1(使用-Dhadoopversion = 23重新编译)或pig-0.13.0运行它时,它不会起作用。
以下几行似乎就是问题所在。
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
这是整个剧本。
batting = load 'pig_data/Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
STORE join_data INTO './join_data';
这里是hadoop错误信息:
2014-07-29 18:03:02,957 [主要]错误 org.apache.pig.tools.pigstats.PigStats - 错误0: org.apache.pig.backend.executionengine.ExecException:错误0: 执行时出现异常(名称:grp_data:Local 重新排列[元组] {bytearray}(false) - 范围-34运算符键:范围-34): org.apache.pig.backend.executionengine.ExecException:ERROR 2106: 执行代数函数时出错2014-07-29 18:03:02,958 [main] 错误org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1张地图 减少工作失败!
如果我仍然想使用" MAX"我该怎么办?功能?谢谢!
以下是完整的信息:
14/07/29 17:50:11 INFO pig.ExecTypeProvider:尝试ExecType:LOCAL 14/07/29 17:50:11 INFO pig.ExecTypeProvider:尝试ExecType: MAPREDUCE 14/07/29 17:50:11 INFO pig.ExecTypeProvider:Picked MAPREDUCE为ExecType 2014-07-29 17:50:12,104 [main] INFO org.apache.pig.Main - Apache Pig版本0.13.0(r1606446)编译 2014年6月29日,2014-07-29 02:27:58 17:50:12,104 [主要]信息 org.apache.pig.Main - 将错误消息记录到: /root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log 2014-07-29 17:50:13,050 [main] INFO org.apache.pig.impl.util.Utils - 未找到默认启动文件/root/.pigbootup 2014-07-29 17:50:13,415 [主要] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker已弃用。相反,使用 mapreduce.jobtracker.address 2014-07-29 17:50:13,415 [主要]信息 org.apache.hadoop.conf.Configuration.deprecation - fs.default.name是 弃用。相反,使用fs.defaultFS 2014-07-29 17:50:13,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 连接到hadoop文件系统: hdfs://namenode.cmda.hadoop.com:8020 2014-07-29 17:50:14,302 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 连接到map-reduce作业跟踪器:namenode.cmda.hadoop.com:8021 2014-07-29 17:50:14,990 [主要] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name是 弃用。相反,使用fs.defaultFS 2014-07-29 17:50:15,570 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - 不推荐使用fs.default.name。相反,请使用fs.defaultFS 2014-07-29 17:50:15,665 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - 遇到警告IMPLICIT_CAST_TO_DOUBLE 1次。 2014年7月29日 17:50:15,705 [主要]信息 org.apache.hadoop.conf.Configuration.deprecation - 不推荐使用mapred.textoutputformat.separator。相反,使用 mapreduce.output.textoutputformat.separator 2014-07-29 17:50:15,791 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig功能 脚本中使用的:HASH_JOIN,GROUP_BY 2014-07-29 17:50:15,873 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED = [AddForEach,ColumnMapKeyPrune, GroupByConstParallelSetter,LimitOptimizer,LoadTypeCastInserter, MergeFilter,MergeForEach,PartitionFilterOptimizer, PushDownForEachFlatten,PushUpFilter,SplitFilter, StreamTypeCastInserter] RULES_DISABLED = [FilterLogicExpressionSimplifier]} 2014-07-29 17:50:16,319 [主要]信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - 文件级联阈值:100乐观? false 2014-07-29 17:50:16,377 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - 选择将代数foreach移动到组合器2014-07-29 17:50:16,410 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler $ LastInputStreamingOptimizer - 重写:POPackage-> POForEach到POPackage(JoinPackager)2014-07-29 17:50:16,417 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - 优化前的MR计划规模:3 2014-07-29 17:50:16,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - 合并1个地图减少分裂者。 2014-07-29 17:50:16,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - 合并了3个MR运营商中的1个。 2014-07-29 17:50:16,418 [主要] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - 优化后的MR计划规模:2 2014-07-29 17:50:16,493 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - 不推荐使用fs.default.name。相反,请使用fs.defaultFS 2014-07-29 17:50:16,575 [main] INFO org.apache.hadoop.yarn.client.RMProxy - 连接到ResourceManager namenode.cmda.hadoop.com/10.0.3.1:8050 2014-07-29 17:50:16,973 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig 脚本设置被添加到作业中2014-07-29 17:50:17,007 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - 不推荐使用mapred.job.reduce.markreset.buffer.percent。相反,使用 mapreduce.reduce.markreset.buffer.percent 2014-07-29 17:50:17,007 [主要] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent未设置,设置为默认值0.3 2014-07-29 17:50:17,007 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - 不推荐使用mapred.output.compress。相反,使用 mapreduce.output.fileoutputformat.compress 2014-07-29 17:50:17,020 [主要] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 减少检测到的相位,估算所需减速器的数量。 2014-07-29 17:50:17,020 [主要] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 使用reducer估计器:org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator 2014-07-29 17:50:17,064 [主要] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer = 1000000000 maxReducers = 999 totalInputFileSize = 6398990 2014-07-29 17:50:17,067 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 将Parallelism设置为1 2014-07-29 17:50:17,067 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks 已弃用。相反,请使用mapreduce.job.reduces 2014-07-29 17:50:17,068 [主要]信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 此作业无法转换运行进程2014-07-29 17:50:17,068 [主要]信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 创建jar文件Job2337803902169382273.jar 2014-07-29 17:50:20,957 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar文件Job2337803902169382273.jar创建于2014-07-29 17:50:20,957 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.jar已弃用。相反,请使用mapreduce.job.jar 2014-07-29 17:50:21,001 [主要] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 设置多店铺工作2014-07-29 17:50:21,036 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple]是 false,不会生成代码。 2014-07-29 17:50:21,036 [主要] INFO org.apache.pig.data.SchemaTupleFrontend - 开始移动的过程 为分布式cacche生成的代码2014-07-29 17:50:21,046 [main] INFO org.apache.pig.data.SchemaTupleFrontend - 设置密钥 [pig.schematuple.classes]包含要反序列化的类[] 2014-07-29 17:50:21,310 [主要]信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1个map-reduce等待提交的工作。 2014-07-29 17:50:21,311 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - 不推荐使用mapred.job.tracker.http.address。相反,使用 mapreduce.jobtracker.http.address 2014-07-29 17:50:21,332 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - 正在连接到 ResourceManager at namenode.cmda.hadoop.com/10.0.3.1:8050 2014-07-29 17:50:21,366 [JobControl]信息 org.apache.hadoop.conf.Configuration.deprecation - fs.default.name是 弃用。相反,请使用fs.defaultFS 2014-07-29 17:50:22,606 [JobControl]信息 org.apache.hadoop.mapreduce.lib.input.FileInputFormat - 总输入 处理路径:1 2014-07-29 17:50:22,606 [JobControl]信息 org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - 总计 要处理的输入路径:1 2014-07-29 17:50:22,629 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - 总计 输入路径(合并)处理:1 2014-07-29 17:50:22,729 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number 分裂:1 2014-07-29 17:50:22,745 [JobControl]信息 org.apache.hadoop.conf.Configuration.deprecation - fs.default.name是 弃用。相反,请使用fs.defaultFS 2014-07-29 17:50:23,026 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - 提交工作代币:job_1406677482986_0003 2014-07-29 17:50:23,258 [JobControl]信息 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - 已提交 application application_1406677482986_0003 2014-07-29 17:50:23,340 [JobControl] INFO org.apache.hadoop.mapreduce.Job - 要跟踪的网址 工作: http://namenode.cmda.hadoop.com:8088/proxy/application_1406677482986_0003/ 2014-07-29 17:50:23,340 [主要] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId:job_1406677482986_0003 2014-07-29 17:50:23,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 处理别名batting,grp_data,max_runs,运行2014-07-29 17:50:23,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 详细位置:M:击球[3,10],跑[5,7],max_runs [7,11],grp_data [6,11] C: max_runs [7,11],grp_data [6,11] R:max_runs [7,11] 2014-07-29 17:50:23,340 [主要] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 有关详情,请访问:http://namenode.cmda.hadoop.com:50030/jobdetails.jsp?jobid=job_1406677482986_0003 2014-07-29 17:50:23,357 [主要] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0%完成2014-07-29 17:50:23,357 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 正在运行的工作是[job_1406677482986_0003] 2014-07-29 17:51:15,564 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50%完成2014-07-29 17:51:15,564 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 正在运行的工作是[job_1406677482986_0003] 2014-07-29 17:51:18,582 [主要]警告 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 哎呀!有些工作失败了!如果希望Pig在失败时立即停止,请指定-stop_on_failure。 2014-07-29 17:51:18,582 [主要] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 工作job_1406677482986_0003失败了!停止运行所有依赖的工作2014-07-29 17:51:18,582 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100%完成2014-07-29 17:51:18,825 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException:错误0: 执行时出现异常(名称:grp_data:Local 重新排列[元组] {bytearray}(false) - scope-73运算符键:scope-73): org.apache.pig.backend.executionengine.ExecException:ERROR 2106: 执行代数函数时出错2014-07-29 17:51:18,825 [main] 错误org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1张地图 减少工作失败! 2014-07-29 17:51:18,826 [主要] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - 脚本 统计数据:
HadoopVersion PigVersion UserId StartedAt FinishedAt功能 2.4.0 0.13.0 root 2014-07-29 17:50:16 2014-07-29 17:51:18 HASH_JOIN,GROUP_BY
失败!
失败的工作:JobId别名功能消息输出 job_1406677482986_0003击球,grp_data,max_runs,运行MULTI_QUERY,COMBINER消息: 工作失败了!
输入:无法从中读取数据 " HDFS://namenode.cmda.hadoop.com:8020 /用户/根/ pig_data / Batting.csv"
输出(一个或多个):
计数器:写入的总记录数:0写入的总字节数:0可溢出 内存管理器泄漏计数:0主动溢出的行李总数:0总计 主动泄漏的记录:0
Job DAG:job_1406677482986_0003 - > null,null
2014-07-29 17:51:18,826 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 失败了! 2014-07-29 17:51:18,827 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2106:执行错误 日志文件中的代数函数详细信息: /root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log 2014-07-29 17:51:18,828 [主要]错误 org.apache.pig.tools.grunt.GruntParser - ERROR 2244:工作范围-58 失败,hadoop不会返回任何错误消息日志文件中的详细信息: /root/hadooptestingsuite/scripts/tests/pig_test/hadoop2/pig_1406677812103.log
答案 0 :(得分:1)
尝试通过转换MAX函数
max_runs = FOREACH grp_data GENERATE group as grp,(int)MAX(runs.runs)as max_runs;
希望它能运作
答案 1 :(得分:1)
您应该在load语句中使用数据类型。
runs = FOREACH batting GENERATE $0 as playerID:chararray, $1 as year:int, $8 as runs:int;
如果出于某种原因这没有帮助,请尝试显式转换。
max_runs = FOREACH grp_data GENERATE group as grp, MAX((int)runs.runs) as max_runs;
答案 2 :(得分:0)
感谢@BigData和@Mikko Kupsu的提示。问题确实有一些事情可以做数据类型转换。
如下所示指定每列的数据类型后,一切都运行良好。
batting =
LOAD '/user/root/pig_data/Batting.csv' USING PigStorage(',')
AS (playerID: CHARARRAY, yearID: INT, stint: INT, teamID: CHARARRAY, lgID: CHARARRAY,
G: INT, G_batting: INT, AB: INT, R: INT, H: INT, two_B: INT, three_B: INT, HR: INT, RBI: INT,
SB: INT, CS: INT, BB:INT, SO: INT, IBB: INT, HBP: INT, SH: INT, SF: INT, GIDP: INT, G_old: INT);