从现有的avro文件夹创建一个配置单元表

时间:2015-12-14 22:02:38

标签: hadoop hive avro

我的hdfs文件夹中有一系列avro文件夹:/gobblin。我根据我所知道的avsc的源结构手动创建了一个avro文件。

如何使用hive中已有的avsc文件从avro创建hdfs表?

谢谢。

更新#1, 我创建了一个CREATE TABLE脚本来创建配置单元表:

CREATE EXTERNAL TABLE Claims(
     PlanID int,
     ClaimID int,
     ClaimAmount int,
     PhysicianID string,
     ClaimType string,
     CreateDate string,
     ModifyDate string)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs://hostname.com/gobblin/job-output/Claims/Claims'
TBLPROPERTIES ('avro.schema.url'='hdfs://hostname.com/gobblin/claims.avsc');

DDL有效,但我无法查询。 在/gobblin/job-output/Claims/Claims文件夹中,每个文件夹中都包含一系列带有avro文件的序列化文件夹。我希望他们在桌子上。 我该如何工作?

感谢。

更新#2 这是我的avsc文件:

{"namespace": "claim.avro",
 "type": "record",
 "name": "claim",
 "fields": [
     {"name": "MemberID", "type": "int"},
     {"name": "PlanID", "type": "int"},
     {"name": "ClaimID", "type": "int"},
     {"name": "ClaimAmount", "type": "int"},
     {"name": "PhysicianID", "type": ["string", "null"]},
     {"name": "ClaimType", "type": ["string", "null"]},
     {"name": "CreateDate", "type": ["string", "null"]},
     {"name": "ModifyDate", "type": ["string", "null"]}
 ]
}

这是堆栈跟踪:

Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1449174649821_0008_3_00, diagnostics=[Task failed, taskId=task_1449174649821_0008_3_00_000001, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: org.apache.avro.AvroTypeException: Found Claims.Claims, expecting Claim.avro.Claim
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: org.apache.avro.AvroTypeException: Found Claims.Claims, expecting Claim.avro.Claim
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:71)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:310)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
        ... 14 more
Caused by: java.io.IOException: org.apache.avro.AvroTypeException: Found Claims.Claims, expecting Claim.avro.Claim
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
        at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
        at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
        at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
        at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
        at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:141)
        at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
        ... 16 more
Caused by: org.apache.avro.AvroTypeException: Found Claims.Claims, expecting Claim.avro.Claim
        at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231)
        at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
        at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:127)
        at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:176)
        at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
        at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
        at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
        at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
        at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:153)
        at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:52)
        at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
        ... 22 more

0 个答案:

没有答案