在Hive中创建外部Avro表时,将Sqoop导入为Avro数据文件会将所有值都设置为NULL

时间:2015-10-23 19:35:35

标签: hadoop hive oozie sqoop avro

我正在尝试使用Sqoop导入自由格式查询将Oracle DB数据导入HDFS,方法是使用' - as-avrodatafile' 使用Oozie调度程序连接两个表。以下是我的workflow.xml的内容:

<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.2" name="sqoop-freeform-wf">
    <start to="sqoop-freeform-node"/>

    <action name="sqoop-freeform-node">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="/apps/hive/warehouse/loc_avro_import"/>
            </prepare>
            <arg>import</arg>
            <arg>--connect</arg>
            <arg>jdbc:oracle:thin:@connection-string:1521:ORCL</arg>
            <arg>--username</arg>
            <arg>comcast</arg>
            <arg>--password</arg>
            <arg>comcast123</arg>
            <arg>--query</arg>
            <arg>select location.location_id, location.street1,location_meta.display_name from location join location_meta on location.location_id=location_meta.location_id WHERE $CONDITIONS</arg>
            <arg>--target-dir</arg>
            <arg>/apps/hive/warehouse/loc_avro_import</arg>
            <arg>--split-by</arg>
            <arg>location.location_id</arg>
            <arg>--as-avrodatafile</arg>
            <arg>-m</arg>
            <arg>1</arg>
        </sqoop>
        <ok to="end"/>
        <error to="fail"/>
    </action>

    <kill name="fail">
        <message>Sqoop free form failed</message>
    </kill>
    <end name="end"/>
</workflow-app>

Oozie作业成功运行,并在目录 / apps / hive / warehouse / loc_avro_import 下的HDFS上创建Avro文件以及_SUCCESS标志。然后我使用以下Hive脚本在此路径上创建外部表:

CREATE external TABLE avro_location(LOCATION_ID string, STREET1 string, DISPLAY_NAME string)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  location
  '/apps/hive/warehouse/loc_avro_import';

该表也是成功创建的,但是当我尝试使用Hive shell重试记录时,它会返回在Oracle中执行自由格式查询时返回的相同行数。但是对于所有行,数据都是NULL。我还尝试使用以下命令查看hive的INFO日志:

hive --hiveconf hive.root.logger=INFO,console

以下是我得到的输出:

hive> select * from avro_location;
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO parse.ParseDriver: Parsing command: select * from avro_location
15/10/23 15:12:02 [main]: INFO parse.ParseDriver: Parse Completed
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=parse start=1445627522004 end=1445627522004 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Starting Semantic Analysis
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Completed phase 1 of Semantic Analysis
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Get metadata for source tables
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Get metadata for subqueries
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Get metadata for destination tables
15/10/23 15:12:02 [main]: INFO ql.Context: New scratch dir is hdfs://sandbox.hortonworks.com:8020/tmp/hive/root/061a4722-0a70-4c28-8b5c-1bf82b63d09f/hive_2015-10-23_15-12-02_004_2341151357389322335-1
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Completed getting MetaData in Semantic Analysis
15/10/23 15:12:02 [main]: INFO parse.BaseSemanticAnalyzer: Not invoking CBO because the statement has too few joins
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: columnComments is 
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: Avro schema is {"type":"record","name":"avro_location","namespace":"default","fields":[{"name":"location_id","type":["null","string"],"default":null},{"name":"street1","type":["null","string"],"default":null},{"name":"display_name","type":["null","string"],"default":null}]}
15/10/23 15:12:02 [main]: INFO common.FileUtils: Creating directory if it doesn't exist: hdfs://sandbox.hortonworks.com:8020/tmp/hive/root/061a4722-0a70-4c28-8b5c-1bf82b63d09f/hive_2015-10-23_15-12-02_004_2341151357389322335-1/-mr-10000/.hive-staging_hive_2015-10-23_15-12-02_004_2341151357389322335-1
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Set stats collection dir : hdfs://sandbox.hortonworks.com:8020/tmp/hive/root/061a4722-0a70-4c28-8b5c-1bf82b63d09f/hive_2015-10-23_15-12-02_004_2341151357389322335-1/-mr-10000/.hive-staging_hive_2015-10-23_15-12-02_004_2341151357389322335-1/-ext-10002
15/10/23 15:12:02 [main]: INFO ppd.OpProcFactory: Processing for FS(2)
15/10/23 15:12:02 [main]: INFO ppd.OpProcFactory: Processing for SEL(1)
15/10/23 15:12:02 [main]: INFO ppd.OpProcFactory: Processing for TS(0)
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Completed plan generation
15/10/23 15:12:02 [main]: INFO ql.Driver: Semantic Analysis Completed
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=semanticAnalyze start=1445627522005 end=1445627522040 duration=35 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: columnComments is 
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: Avro schema is {"type":"record","name":"avro_location","namespace":"default","fields":[{"name":"location_id","type":["null","string"],"default":null},{"name":"street1","type":["null","string"],"default":null},{"name":"display_name","type":["null","string"],"default":null}]}
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Initializing operator TS[0]
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Initialization Done 0 TS done is reset.
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Operator 0 TS initialized
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Initializing children of 0 TS
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initializing child 1 SEL
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initializing operator SEL[1]
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: SELECT struct<location_id:string,street1:string,display_name:string>
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initialization Done 1 SEL done is reset.
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Operator 1 SEL initialized
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initializing children of 1 SEL
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Initializing child 3 OP
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Initializing operator OP[3]
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Initialization Done 3 OP done is reset.
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Operator 3 OP initialized
15/10/23 15:12:02 [main]: INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:avro_location.location_id, type:string, comment:null), FieldSchema(name:avro_location.street1, type:string, comment:null), FieldSchema(name:avro_location.display_name, type:string, comment:null)], properties:null)
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=compile start=1445627522003 end=1445627522041 duration=38 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=acquireReadWriteLocks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO lockmgr.DbTxnManager: Setting lock request transaction to txnid:0 for queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012
15/10/23 15:12:02 [main]: INFO lockmgr.DbLockManager: Requesting: queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012 LockRequest(component:[LockComponent(type:SHARED_READ, level:TABLE, dbname:default, tablename:avro_location)], txnid:0, user:root, hostname:ip-sandbox.hortonworks.com)
15/10/23 15:12:02 [main]: INFO lockmgr.DbLockManager: Response to queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012 LockResponse(lockid:78, state:ACQUIRED)
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=acquireReadWriteLocks start=1445627522041 end=1445627522050 duration=9 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO ql.Driver: Starting command(queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012): select * from avro_location
15/10/23 15:12:02 [main]: INFO hooks.ATSHook: Created ATS Hook
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=PreHook.org.apache.hadoop.hive.ql.hooks.ATSHook from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=PreHook.org.apache.hadoop.hive.ql.hooks.ATSHook start=1445627522050 end=1445627522050 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=TimeToSubmit start=1445627522003 end=1445627522050 duration=47 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=runTasks start=1445627522051 end=1445627522051 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO hooks.ATSHook: Created ATS Hook
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=PostHook.org.apache.hadoop.hive.ql.hooks.ATSHook from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=PostHook.org.apache.hadoop.hive.ql.hooks.ATSHook start=1445627522051 end=1445627522051 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=Driver.execute start=1445627522050 end=1445627522051 duration=1 from=org.apache.hadoop.hive.ql.Driver>
OK
15/10/23 15:12:02 [main]: INFO ql.Driver: OK
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1445627522052 end=1445627522118 duration=66 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=Driver.run start=1445627522003 end=1445627522118 duration=115 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO mapred.FileInputFormat: Total input paths to process : 1
15/10/23 15:12:02 [main]: INFO avro.AvroGenericRecordReader: Found the avro schema in the job: {"type":"record","name":"avro_location","namespace":"default","fields":[{"name":"location_id","type":["null","string"],"default":null},{"name":"street1","type":["null","string"],"default":null},{"name":"display_name","type":["null","string"],"default":null}]}
15/10/23 15:12:02 [main]: INFO avro.AvroDeserializer: Adding new valid RRID :678ef2a1:150961947b5:-7fff
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: 0 finished. closing... 
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: 1 finished. closing... 
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: 3 finished. closing... 
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: 3 Close done
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: 1 Close done
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: 0 Close done
Time taken: 0.115 seconds, Fetched: 30 row(s)
15/10/23 15:12:02 [main]: INFO CliDriver: Time taken: 0.115 seconds, Fetched: 30 row(s)
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1445627522136 end=1445627522136 duration=0 from=org.apache.hadoop.hive.ql.Driver>

如上所述,它获取30行,但所有值都为NULL。任何人都可以帮助我如何解决这个问题。

2 个答案:

答案 0 :(得分:0)

我认为您需要为表指定avro架构位置,如下所示: TBLPROPERTIES('avro.schema.url'='/ apps / hive / warehouse / loc_avro_import / schemaname.avsc');

答案 1 :(得分:0)

SQOOP导入生成本地的Avsc文件。将其复制到架构文件夹。查询表。