猪群和平均功能

时间:2013-04-20 01:10:16

标签: hadoop amazon-web-services apache-pig elastic-map-reduce

我的数据看起来像这样

STN--- WBAN   YEARMODA    TEMP       DEWP      SLP        STP       VISIB      WDSP     MXSPD   GUST    MAX     MIN   PRCP   SNDP   FRSHTT
030050 99999  19291029    46.7  4    42.0  4   990.9  4  9999.9  0   10.9  4   13.0  4   13.0  999.9    46.9*   44.1  99.99  999.9  010000
030050 99999  19291030    43.5  4    33.5  4  1015.4  4  9999.9  0   12.4  4   14.3  4   18.1  999.9    46.9    42.1   0.00I 999.9  000000
030050 99999  19291031    43.7  4    37.3  4  1026.8  4  9999.9  0   12.4  4    4.5  4    8.9  999.9    46.9*   37.9   0.00I 999.9  000000
030050 99999  19291101    49.2  4    45.5  4  1019.9  4  9999.9  0    6.2  4    8.2  4   13.0  999.9    51.1*   46.0  99.99  999.9  010000
030050 99999  19291102    47.0  4    44.5  4  1013.6  4  9999.9  0    7.8  4    6.2  4    8.9  999.9    51.1    44.1   0.00I 999.9  000000
030050 99999  19291103    44.0  4    36.0  4  1009.2  4  9999.9  0   10.9  4    8.0  4    8.9  999.9    50.0    42.1   0.00I 999.9  000000

我想得到每个月的平均值,在这种情况下:10和11。

首先我使用以下方法加载数据:

RAW_LOGS = LOAD 'data' as (line:chararray);

然后我使用正则表达式将数据分成不同的变量:

LOGS_BASE = FOREACH RAW_LOGS GENERATE 
    FLATTEN( 
       REGEX_EXTRACT_ALL(line, '^(\\d+)\\s+(\\d+)\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d+\\.\\d).*$')  
    ) 
    as (
      STN: int, 
      WBAN: int, 
      YEAR: int, 
      MONTH: int,
      DAY: int,
      TEMP: float
  );

接下来,我摆脱了之前包含标题数据的顶级元组:

no_nulls = FILTER LOGS_BASE BY STN is not null;

然后我按STN,WBAN,YEAR和MONTH对数据进行分组:

grouped = group no_nulls by STN..MONTH;

最后我尝试生成一个平均值并遇到错误:

C = FOREACH grouped GENERATE AVG(LOGS_BASE.TEMP);

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045:
<line 17, column 29> Could not infer the matching function for org.apache.pig.builtin.AVG as    multiple or none of them fit. Please use an explicit cast.

我认为错误可能在于我的正则表达式,因为它将TEMP作为一个字符串返回,即使我告诉它是一个双重但我可能是错的。

编辑:我将C改为:

C = FOREACH grouped GENERATE AVG(no_nulls.TEMP);

现在我收到了这个错误:

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
1.0.3   0.9.2-amzn      hadoop  2013-04-20 19:55:25     2013-04-20 19:57:21     GROUP_BY,FILTER

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_201304201942_0001   C,LOGS_BASE,RAW_LOGS,grouped,no_nulls   GROUP_BY,COMBINER       Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201304201942_0001_m_000000 hdfs://10.254.106.85:9000/tmp/temp413183623/tmp1677272203,

日志有更多信息:

org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
    at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:99)
    at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:75)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float
    at org.apache.pig.builtin.FloatAvg$Initial.exec(FloatAvg.java:86)
    ... 19 more

Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias C. Backend error : Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
    at org.apache.pig.PigServer.openIterator(PigServer.java:890)
    at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:679)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
    at org.apache.pig.Main.run(Main.java:500)
    at org.apache.pig.Main.main(Main.java:114)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:354)
    at org.apache.pig.PigServer.launchPlan(PigServer.java:1313)
    at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1298)
    at org.apache.pig.PigServer.storeEx(PigServer.java:995)
    at org.apache.pig.PigServer.store(PigServer.java:962)
    at org.apache.pig.PigServer.openIterator(PigServer.java:875)

2 个答案:

答案 0 :(得分:0)

我的猜测是因为分组不包含LOGS_BASE,它包含no_nulls。尝试制作

C = FOREACH grouped GENERATE AVG(no_nulls.TEMP);

并查看是否可以修复它。

如果这不起作用,请尝试在第一行之后添加dump RAW_LOGS并注释其他所有内容,确保看起来不错,然后取消注释第二行并进行转储dump LOGS_BASE,重复以便休息的线条。总是善于理智地检查每一块猪脚本。

答案 1 :(得分:-1)

事实证明,temp被视为String而不是Float。我应用了使用here的代码并使其工作。尽管我告诉Pig将TEMP柱作为浮子处理,但它仍然是以chararray的形式读取它。通过将(tuple(int,int,int,int,int,float))放在我的REGEX_EXTRACT_ALL函数之前,这最终成为一行修复。这是代码的样子:

LOGS_BASE = FOREACH RAW_LOGS GENERATE 
    FLATTEN( 
        (tuple(int,int,int,int,int,float))
       REGEX_EXTRACT_ALL(line, '^(\\d+)\\s+(\\d+)\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(-?\\d+\\.\\d).*$')  
    ) 
    as (
      STN: int, 
      WBAN: int, 
      YEAR: int, 
      MONTH: int,
      DAY: int,
      TEMP: float
  );