我从课程中分配了一个特定条件筛选的条目数。
我的数据集下面的问题有以下架构。
data1 = LOAD '/answers.csv' USING PigStorage(',') AS (qid:long,qt:long,tag:chararray,at:long);
qid = question ID, qt = question start time(in epoch time), at = answer end time(in epoch time);
示例数据集:
sn qid qt标签
1 563355 1235000081 php,error,gd,image-processing 1235000501
2 563355 1235000081 php,error,gd,image-processing 1235000551
3 563356 1235000140 lisp,scheme,subjective,clojure 1235000177
4 563356 1235000140 lisp,scheme,subjective,clojure 1235001545
5 563356 1235000140 lisp,scheme,subjective,clojure 1235002457
6 563356 1235000140 lisp,scheme,subjective,clojure 1235002809
7 563356 1235000140 lisp,scheme,subjective,clojure 1235003266
8 563356 1235000140 lisp,scheme,subjective,clojure 1235007817
9 563356 1235000140 lisp,scheme,subjective,clojure 1235007913
10 563356 1235000140 lisp,scheme,subjective,clojure 1235020626
11 563356 1235000140 lisp,scheme,subjective,clojure 1235040652
需要在1小时内找到答案的数量。
方法:PIG版本0.15.0
找到qt和
之间的差异hrsA = FOREACH data1 GENERATE HoursBetween(ToDate(qt),ToDate(at)) AS diffhours;
B = FOREACH (FILTER A BY diffhours < 1) GENERATE diffhours;
C = GROUP B ALL;
D = FOREACH C GENERATE COUNT(B.diffhours) ;
但当我转储D时,作业失败并带有以下评论:
2016-04-06 01:13:17,736 [LocalJobRunner Map Task Executor #0] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.Utf8StorageConverter(FIELD_DISCARDED_TYPE_CONVERSION_FAILED): Unable to interpret value [112, 114, 111, 103, 114, 97, 109, 109, 105, 110, 103] in field being converted to int, caught NumberFormatException <For input string: "programming"> field discarded
2016-04-06 01:13:17,736 [LocalJobRunner Map Task Executor #0] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.Utf8StorageConverter(FIELD_DISCARDED_TYPE_CONVERSION_FAILED): Unable to interpret value [115, 117, 98, 106, 101, 99, 116, 105, 118, 101, 34] in field being converted to int, caught NumberFormatException <For input string: "subjective""> field discarded
最后我得到了这些......
Pig Stack Trace
---------------
ERROR 1200: <line 6, column 0> Syntax error, unexpected symbol at or near 'D'
Failed to parse: <line 6, column 0> Syntax error, unexpected symbol at or near 'D'
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:244)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:182)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1707)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1680)
at org.apache.pig.PigServer.registerQuery(PigServer.java:623)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1082)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:565)
at org.apache.pig.Main.main(Main.java:177)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
================================================================================
Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias D
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias D
at org.apache.pig.PigServer.openIterator(PigServer.java:935)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:565)
at org.apache.pig.Main.main(Main.java:177)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Job terminated with anomalous status FAILED
at org.apache.pig.PigServer.openIterator(PigServer.java:927)
... 13 more
我无法理解这个问题。
答案 0 :(得分:2)
字段标记中的嵌入式逗号导致所有问题。由于您只在架构中定义了四个字段,因此Pig无法使用您定义的架构读取数据。
PigStorage是一个非常简单的加载器,它不处理特殊情况,例如嵌入分隔符或转义控制字符;无论上下文如何,它都会在分隔符的每个实例上拆分。
使用Piggybank的CSVExcelStorage()来处理字段中的所有嵌入式逗号。
REGISTER /usr/lib/pig/piggybank.jar;
DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage;
data1 = LOAD '/answers.csv' USING CSVExcelStorage();
然后执行脚本的其余部分。这将给你预期的结果。
答案 1 :(得分:0)
GROUP
运算符将具有相同组键(键字段)的元组组合在一起。
COUNT
函数的使用是计算包中元素的数量。 COUNT
要求全局计数的前一个GROUP ALL
语句和组计数的GROUP BY
语句。
在您的情况下,您在B上呼叫数字,数据为filtered
而不是grouped by
您需要在数据分组的变量上调用COUNT
。
D = FOREACH C GENERATE COUNT(B.diffhours);