在hadoop中,我们获得了存储在'/ datasets / xyz / storm / information /'下的avro文件列表。
-rw-r----- 3 storm XYZ 5570959 2015-10-01 01:46 /datasets/xyz/storm/information/storm_1443681972122.avro
-rw-r----- 3 storm XYZ 5571687 2015-10-01 01:46 /datasets/xyz/storm/information/storm_1443681973303.avro
-rw-r----- 3 storm XYZ 5632194 2015-10-01 01:46 /datasets/xyz/storm/information/storm_1443681975019.avro
什么有效?:
a= LOAD '/datasets/xyz/storm/information/storm_1443681975019.avro' USING AvroStorage ();
Avro Schema在每个avro文件中定义为以下格式的第一条记录:
{header: (metadata_uuid: chararray,publishDate: chararray,eventDate: chararray),raw_data: chararray}
我想将所有avro文件数据一次加载到别名'a'中。所以,我尝试了以下代码:
a= = LOAD '/datasets/xyz/storm/information/' using AvroStorage();
我的例外如下:
ERROR 2245: Cannot get schema from loadFunc org.apache.pig.builtin.AvroStorage
我还尝试明确提供架构,如下所示:
a= LOAD '/datasets/xyz/storm/information/' USING AvroStorage ('schema','{"header": ("metadata_uuid": "chararray","publishDate": "chararray","eventDate": "chararray"),"raw_data": "chararray"}');
你能告诉我正确的方法吗?
谢谢!
答案 0 :(得分:1)
提供的架构不正确以及格式。我从AvroStorage参数中删除了“schema”。 我改变了脚本如下:
a= LOAD '/datasets/xyz/storm/information/' USING AvroStorage('{"type" : "record","name" : "DataRecord","namespace" : "com.bestbuy.sim.appTalkProjects.adobe.adobeClickStreamBDPSA.util","doc" : "Schema for com.bestbuy.sim.appTalkProjects.adobe.adobeClickStreamBDPSA.util.DataRecord","fields" : [ {"name" : "header","type" : [ "null", {"type" : "record","name" : "Header","doc" : "Schema for com.bestbuy.sim.appTalkProjects.adobe.adobeClickStreamBDPSA.util.Header","fields" : [ {"name" : "metadata_uuid","type" : [ "null", "string" ]}, {"name" : "publishDate","type" : [ "null", "string" ]}, {"name" : "eventDate","type" : [ "null", "string" ]} ]} ]}, {"name" : "raw_data","type" : [ "null", "string" ]} ]}');
这使得负载成功。