我在google n-gram数据集上运行了以下命令:
inp = LOAD 'link to file' AS (ngram:chararray, year:int, occurences:float, books:float);
filter_input = FILTER inp BY (occurences >= 400) AND (books >= 8);
groupinp = GROUP filter_input BY ngram;
sum_occ = FOREACH groupinp GENERATE FLATTEN(group) as ngram, SUM(filter_input.occurences) / SUM(filter_input.books) AS ntry;
roundto = FOREACH sum_occ GENERATE sum_occ.ngram, ROUND_TO( sum_occ.ntry , 2 );
但是我收到以下错误:
DUMP roundto;
601062 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_FLOAT 2 time(s).
18/04/06 01:46:03 WARN newplan.BaseOperatorPlan: Encountered Warning IMPLICIT_CAST_TO_FLOAT 2 time(s).
601067 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY,FILTER
18/04/06 01:46:03 INFO pigstats.ScriptState: Pig features used in the script: GROUP_BY,FILTER
601111 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
18/04/06 01:46:03 INFO data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code.
601111 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
18/04/06 01:46:03 INFO optimizer.LogicalPlanOptimizer: {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
601238 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - Tez staging directory is /tmp/temp-336429202 and resources directory is /tmp/temp-336429202
18/04/06 01:46:03 INFO tez.TezLauncher: Tez staging directory is /tmp/temp-336429202 and resources directory is /tmp/temp-336429202
601239 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.plan.TezCompiler - File concatenation threshold: 100 optimistic? false
18/04/06 01:46:03 INFO plan.TezCompiler: File concatenation threshold: 100 optimistic? false
601241 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.CombinerOptimizerUtil - Choosing to move algebraic foreach to combiner
18/04/06 01:46:03 INFO util.CombinerOptimizerUtil: Choosing to move algebraic foreach to combiner
601265 [main] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
18/04/06 01:46:03 INFO builtin.PigStorage: Using PigTextInputFormat
18/04/06 01:46:03 INFO input.FileInputFormat: Total input files to process : 1
601285 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
18/04/06 01:46:03 INFO util.MapRedUtil: Total input paths to process : 1
601285 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
18/04/06 01:46:03 INFO util.MapRedUtil: Total input paths (combined) to process : 1
18/04/06 01:46:03 INFO hadoop.MRInputHelpers: NumSplits: 1, SerializedSize: 408
601322 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: joda-time-2.9.4.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: joda-time-2.9.4.jar
601322 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: pig-0.17.0-core-h2.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: pig-0.17.0-core-h2.jar
601322 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: antlr-runtime-3.4.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: antlr-runtime-3.4.jar
601322 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Local resource: automaton-1.11-8.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: automaton-1.11-8.jar
601402 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - For vertex - scope-141: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
18/04/06 01:46:03 INFO tez.TezDagBuilder: For vertex - scope-141: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
601402 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Processing aliases: filter_input,groupinp,inp,sum_occ
18/04/06 01:46:03 INFO tez.TezDagBuilder: Processing aliases: filter_input,groupinp,inp,sum_occ
601402 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],sum_occ[4,10],groupinp[3,11]
18/04/06 01:46:03 INFO tez.TezDagBuilder: Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],sum_occ[4,10],groupinp[3,11]
601402 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Pig features in the vertex:
18/04/06 01:46:03 INFO tez.TezDagBuilder: Pig features in the vertex:
601449 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Set auto parallelism for vertex scope-142
18/04/06 01:46:03 INFO tez.TezDagBuilder: Set auto parallelism for vertex scope-142
601450 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - For vertex - scope-142: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
18/04/06 01:46:03 INFO tez.TezDagBuilder: For vertex - scope-142: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA
601450 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Processing aliases: roundto,sum_occ
18/04/06 01:46:03 INFO tez.TezDagBuilder: Processing aliases: roundto,sum_occ
601450 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Detailed locations: sum_occ[4,10],roundto[6,10]
18/04/06 01:46:03 INFO tez.TezDagBuilder: Detailed locations: sum_occ[4,10],roundto[6,10]
601450 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder - Pig features in the vertex: GROUP_BY
18/04/06 01:46:03 INFO tez.TezDagBuilder: Pig features in the vertex: GROUP_BY
601489 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler - Total estimated parallelism is 2
18/04/06 01:46:04 INFO tez.TezJobCompiler: Total estimated parallelism is 2
601531 [PigTezLauncher-0] INFO org.apache.pig.tools.pigstats.tez.TezScriptState - Pig script settings are added to the job
18/04/06 01:46:04 INFO tez.TezScriptState: Pig script settings are added to the job
18/04/06 01:46:04 INFO client.TezClient: Tez Client Version: [ component=tez-api, version=0.8.4, revision=300391394352b074b85b529e870816a72c6f314a, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2018-03-21T23:55:28Z ]
18/04/06 01:46:04 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-28-12.ec2.internal/172.31.28.12:8032
18/04/06 01:46:04 INFO client.TezClient: Using org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager to manage Timeline ACLs
18/04/06 01:46:04 INFO impl.TimelineClientImpl: Timeline service address: http://ip-172-31-28-12.ec2.internal:8188/ws/v1/timeline/
18/04/06 01:46:04 INFO client.TezClient: Session mode. Starting session.
18/04/06 01:46:04 INFO client.TezClientUtils: Using tez.lib.uris value from configuration: hdfs:///apps/tez/tez.tar.gz
18/04/06 01:46:04 INFO client.TezClientUtils: Using tez.lib.uris.classpath value from configuration: null
18/04/06 01:46:04 INFO client.TezClient: Tez system stage directory hdfs://ip-172-31-28-12.ec2.internal:8020/tmp/temp-336429202/.tez/application_1522978297921_0003 doesn't exist and is created
18/04/06 01:46:04 INFO acls.ATSHistoryACLPolicyManager: Created Timeline Domain for History ACLs, domainId=Tez_ATS_application_1522978297921_0003
18/04/06 01:46:04 INFO impl.YarnClientImpl: Submitted application application_1522978297921_0003
18/04/06 01:46:04 INFO client.TezClient: The url to track the Tez Session: http://ip-172-31-28-12.ec2.internal:20888/proxy/application_1522978297921_0003/
607861 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - Submitting DAG PigLatin:DefaultJobName-0_scope-2
18/04/06 01:46:10 INFO tez.TezJob: Submitting DAG PigLatin:DefaultJobName-0_scope-2
18/04/06 01:46:10 INFO client.TezClient: Submitting dag to TezSession, sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003, dagName=PigLatin:DefaultJobName-0_scope-2, callerContext={ context=PIG, callerType=PIG_SCRIPT_ID, callerId=PIG-default-d73e19dc-5287-4ee2-a85d-e931327011dc }
18/04/06 01:46:10 INFO client.TezClient: Submitted dag to TezSession, sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003, dagName=PigLatin:DefaultJobName-0_scope-2
18/04/06 01:46:10 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-28-12.ec2.internal/172.31.28.12:8032
608409 [PigTezLauncher-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - Submitted DAG PigLatin:DefaultJobName-0_scope-2. Application id: application_1522978297921_0003
18/04/06 01:46:10 INFO tez.TezJob: Submitted DAG PigLatin:DefaultJobName-0_scope-2. Application id: application_1522978297921_0003
608528 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - HadoopJobId: job_1522978297921_0003
18/04/06 01:46:11 INFO tez.TezLauncher: HadoopJobId: job_1522978297921_0003
609410 [Timer-1] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=null
18/04/06 01:46:11 INFO tez.TezJob: DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=null
629410 [Timer-1] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 1 Failed: 0 Killed: 0, diagnostics=, counters=null
18/04/06 01:46:31 INFO tez.TezJob: DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 1 Failed: 0 Killed: 0, diagnostics=, counters=null
646404 [pool-1-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager - Shutting down Tez session org.apache.tez.client.TezClient@3a371843
18/04/06 01:46:48 INFO tez.TezSessionManager: Shutting down Tez session org.apache.tez.client.TezClient@3a371843
2018-04-06 01:46:48 Shutting down Tez session , sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003
18/04/06 01:46:48 INFO client.TezClient: Shutting down Tez Session, sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003
如何修复此错误?转储命令适用于除roundto之外的前一行。什么是Tez客户端?
答案 0 :(得分:1)
我无法复制您的输出,因为我在尝试此行时会收到错误:
roundto = FOREACH sum_occ GENERATE sum_occ.ngram, ROUND_TO( sum_occ.ntry , 2 );
您无需使用dot operator来引用这些字段(例如sum_occ.ngram
),因为它们没有嵌套在元组或包中。尝试不带点运算符的上述行:
roundto = FOREACH sum_occ GENERATE ngram, ROUND_TO( ntry , 2 );
要回答第二个问题,MapReduce和Tez都是可用于运行Pig脚本的框架。 Tez有时可以加快Pig脚本运行所需的时间。您可以通过pig -x mapreduce
或pig -x tez
启动Pig shell来显式使用MapReduce或Tez。 MapReduce是默认值,因此如果您尚未指定Tez,则必须将您的Hadoop集群设置为在Tez中运行Pig。