我的数据看起来像这样
STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT
030050 99999 19291029 46.7 4 42.0 4 990.9 4 9999.9 0 10.9 4 13.0 4 13.0 999.9 46.9* 44.1 99.99 999.9 010000
030050 99999 19291030 43.5 4 33.5 4 1015.4 4 9999.9 0 12.4 4 14.3 4 18.1 999.9 46.9 42.1 0.00I 999.9 000000
030050 99999 19291031 43.7 4 37.3 4 1026.8 4 9999.9 0 12.4 4 4.5 4 8.9 999.9 46.9* 37.9 0.00I 999.9 000000
030050 99999 19291101 49.2 4 45.5 4 1019.9 4 9999.9 0 6.2 4 8.2 4 13.0 999.9 51.1* 46.0 99.99 999.9 010000
030050 99999 19291102 47.0 4 44.5 4 1013.6 4 9999.9 0 7.8 4 6.2 4 8.9 999.9 51.1 44.1 0.00I 999.9 000000
030050 99999 19291103 44.0 4 36.0 4 1009.2 4 9999.9 0 10.9 4 8.0 4 8.9 999.9 50.0 42.1 0.00I 999.9 000000
我想得到每个月的平均值,在这种情况下:10和11。
首先我使用以下方法加载数据:
RAW_LOGS = LOAD 'data' as (line:chararray);
然后我使用正则表达式将数据分成不同的变量:
LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
REGEX_EXTRACT_ALL(line, '^(\\d+)\\s+(\\d+)\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d+\\.\\d).*$')
)
as (
STN: int,
WBAN: int,
YEAR: int,
MONTH: int,
DAY: int,
TEMP: float
);
接下来,我摆脱了之前包含标题数据的顶级元组:
no_nulls = FILTER LOGS_BASE BY STN is not null;
然后我按STN,WBAN,YEAR和MONTH对数据进行分组:
grouped = group no_nulls by STN..MONTH;
最后我尝试生成一个平均值并遇到错误:
C = FOREACH grouped GENERATE AVG(LOGS_BASE.TEMP);
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045:
<line 17, column 29> Could not infer the matching function for org.apache.pig.builtin.AVG as multiple or none of them fit. Please use an explicit cast.
我认为错误可能在于我的正则表达式,因为它将TEMP作为一个字符串返回,即使我告诉它是一个双重但我可能是错的。
编辑:我将C改为:
C = FOREACH grouped GENERATE AVG(no_nulls.TEMP);
现在我收到了这个错误:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.3 0.9.2-amzn hadoop 2013-04-20 19:55:25 2013-04-20 19:57:21 GROUP_BY,FILTER
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201304201942_0001 C,LOGS_BASE,RAW_LOGS,grouped,no_nulls GROUP_BY,COMBINER Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201304201942_0001_m_000000 hdfs://10.254.106.85:9000/tmp/temp413183623/tmp1677272203,
日志有更多信息:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:99)
at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:75)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float
at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:86)
... 19 more
Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias C. Backend error : Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
at org.apache.pig.PigServer.openIterator(PigServer.java:890)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:679)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:500)
at org.apache.pig.Main.main(Main.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:354)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1313)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1298)
at org.apache.pig.PigServer.storeEx(PigServer.java:995)
at org.apache.pig.PigServer.store(PigServer.java:962)
at org.apache.pig.PigServer.openIterator(PigServer.java:875)
答案 0 :(得分:0)
我的猜测是因为分组不包含LOGS_BASE,它包含no_nulls。尝试制作
C = FOREACH grouped GENERATE AVG(no_nulls.TEMP);
并查看是否可以修复它。
如果这不起作用,请尝试在第一行之后添加dump RAW_LOGS
并注释其他所有内容,确保看起来不错,然后取消注释第二行并进行转储dump LOGS_BASE
,重复以便休息的线条。总是善于理智地检查每一块猪脚本。
答案 1 :(得分:-1)
事实证明,temp被视为String而不是Float。我应用了使用here的代码并使其工作。尽管我告诉Pig将TEMP柱作为浮子处理,但它仍然是以chararray的形式读取它。通过将(tuple(int,int,int,int,int,float))
放在我的REGEX_EXTRACT_ALL
函数之前,这最终成为一行修复。这是代码的样子:
LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
(tuple(int,int,int,int,int,float))
REGEX_EXTRACT_ALL(line, '^(\\d+)\\s+(\\d+)\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(-?\\d+\\.\\d).*$')
)
as (
STN: int,
WBAN: int,
YEAR: int,
MONTH: int,
DAY: int,
TEMP: float
);