我正在尝试使用Pig和Twitter的大象鸟库解析一个相当简单的json文件,但这变成了非常痛苦的调试过程。
json具有以下结构:
oid_id: (oid:chararray),
bookmarks: {(
oid_id:(oid:chararray),
id:chararray,
creator: chararray,
position:chararray,
creationdate:($ate:chararray)
)},
lastaction:(date:chararray),
settings:(preferredlanguage:chararray),
userid:chararray
行的示例:
{“ oid_id”:{“ oid”:“ 573239f905474a686e2333f0”},“书签”:[{“ id”:“ LEGONINX106W0079264”,“创建者”:“玩家”,“位置”:96,“创建日期”: {“ date”:“ 2016-12-26T09:37:36.916Z”},“ oid_id”:{“ oid”:“ 5860e4e0ca6baf9032edc0d0”}},{“ id”:“ ONEPERCENTMW0128677”,“ creator”:“ player” ,“ position”:0.08,“ creationdate”:{“ date”:“ 2018-12-18T15:42:33.956Z”},“ oid_id”:{“ oid”:“ 5c191569faf8474953758930”}}],“ lastaction”: {“ date”:“ 2018-12-18T15:42:28.107Z”},“ settings”:{“ preferredlanguage”:“ vf”,“ preferredvideoquality”:“ hd”},“ userid”:“ ocs_32a6ad6dd242d5e3842f9211fd236723_1461773211”} < / p>
这是我的代码(受本教程的启发: https://acadgild.com/blog/determining-popular-hashtags-in-twitter-using-pig)
register /path/to/json-simple-1.1.1.jar
register /path/to/elephant-bird-core-4.17.jar
register /path/to/elephant-bird-pig-4.17.jar
register /path/to/elephant-bird-hadoop-compat-4.17.jar
define JsonLoaderEB com.twitter.elephantbird.pig.load.JsonLoader;
A = LOAD 'file.json' USING JsonLoaderEB('-nestedLoad=true') as myMap;
describe A;
input_table:{ myMap:bytearray}
B = foreach A generate flatten(myMap#'bookmarks') as (bookmark:map[]);
describe B;
B:{ 书签:map []}
转储上述关系时,可以看到所有数据已成功加载。
([[{“ oid_id”:{“ oid”:“ 5860e4e0ca6baf9032edc0d0”},“ creator”:“ player”,“ creationdate”:{“ date”:“ 2016-12-26T09:37:36.916Z”} ,“ id”:“ LEGONINX106W0079264”,“位置”:96},{“ oid_id”:{“ oid”:“ 5c191569faf8474953758930”},“创建者”:“玩家”,“创建日期”:{“日期”:“ 2018 -12-18T15:42:33.956Z“},” id“:” ONEPERCENTMW0128677“,” position“:0.08}])
现在,我们从书签中提取creationdate,creator,id和位置。
C = foreach B generate bookmark#'creationdate' as date_fact, bookmark#'creator' as creator, bookmark#'id' as id, bookmark#'position' as position;
C:{ date_fact:字节数组, 创建者:bytearray, id:bytearray, 位置:字节数组 }
转储表时出现以下错误:
错误1066:无法打开别名C的迭代器。后端错误:顶点失败,vertexName = scope-41,vertexId = vertex_1542613138136_6721 88_2_00,诊断程序= [任务失败,taskId =任务_1542613138136_672188_2_00_000000,诊断程序= [TaskAttempt 0失败,信息= [错误:错误 运行任务时失败(失败):try_1542613138136_672188_2_00_000000_0:org.apache.pig.backend.executionengine.ExecException:错误 0:执行时发生异常(名称:C:存储(hdfs:// sandbox / tmp / temp-1543074195 / tmp277240455:org.apache.pig.impl.io.InterStorage)-sc ope-40操作员密钥:scope-40):org.apache.pig.backend.executionengine.ExecException:错误0:执行[POMapLookUp( 名称:POMapLookUp [bytearray]-范围28操作符:范围28)子级:在[null [4,31]]中为null:java.lang.ClassCastException:java.lan g.String无法转换为java.util.Map 在org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:315) 在org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123) 在org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:376) 在org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:241) 在org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) 在org.apache.tez.runtime.task.TaskRunner2Callable $ 1.run(TaskRunner2Callable.java:73) 在org.apache.tez.runtime.task.TaskRunner2Callable $ 1.run(TaskRunner2Callable.java:61) 在java.security.AccessController.doPrivileged(本机方法) 在javax.security.auth.Subject.doAs(Subject.java:422) 在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) 在org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) 在org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) 在org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) 在java.util.concurrent.FutureTask.run(FutureTask.java:266) 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624) 在java.lang.Thread.run(Thread.java:748) 原因:org.apache.pig.backend.executionengine.ExecException:错误0:执行[POMapLookUp(名称:POMapLookUp [byt] earray]-scope-28操作员键:scope-28)子级:在[null [4,31]]中为null:java.lang.ClassCastException:java.lang.String不能为ca st到java.util.Map 在org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:364) 在org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:406) 在org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:323) 在org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)上 1,9顶部
答案 0 :(得分:0)
即使对于table_extraction
关系,它也能提供很好的结果,但是它可能来自原始数据。
您能否删除或更正以下对象,它看起来无效:
"oid":"5c191393faf8475cb76ee0d5"