通过Spark RDD以ORC格式在hive中存储数据

时间:2015-08-14 10:03:16

标签: hadoop apache-spark hive rdd orc

根据我的要求,我想将hdfs中的文件以ORC格式存储到hive表中。我使用Spark 1.2.1和Hive 0.14.0版本。

我已按照以下文档进行操作 http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_spark-quickstart/content/ch_orc-spark-quickstart.html

所有的事情都很顺利..我在火花壳中看不到任何异常..

我在hive中创建了一个ORC表,如下所示..

hiveContext.sql("create table person_orc_table (name STRING, age INT) stored as orc")

我可以看到列表查询结果如下..

scala> hiveContext.sql("SELECT * from morePeople").collect.foreach(println)
15/08/14 09:25:06 INFO ParseDriver: Parsing command: SELECT * from morePeople
15/08/14 09:25:06 INFO ParseDriver: Parse Completed
15/08/14 09:25:06 INFO OrcFileOperator: Qualified file list: 
15/08/14 09:25:06 INFO OrcFileOperator: hdfs://sandbox.hortonworks.com:8020/user/root/people.orc/part-r-0-1439544199994.orc
15/08/14 09:25:06 INFO OrcFileOperator: hdfs://sandbox.hortonworks.com:8020/user/root/people.orc/part-r-1-1439544200299.orc
15/08/14 09:25:06 INFO MemoryStore: ensureFreeSpace(278167) called with curMem=965233, maxMem=278302556
15/08/14 09:25:06 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 271.6 KB, free 264.2 MB)
15/08/14 09:25:06 INFO MemoryStore: ensureFreeSpace(42885) called with curMem=1243400, maxMem=278302556
15/08/14 09:25:06 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 41.9 KB, free 264.2 MB)
15/08/14 09:25:06 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on sandbox.hortonworks.com:43599 (size: 41.9 KB, free: 265.2 MB)
15/08/14 09:25:06 INFO BlockManagerMaster: Updated info of block broadcast_6_piece0
15/08/14 09:25:06 INFO DefaultExecutionContext: Created broadcast 6 from hadoopRDD at OrcTableOperations.scala:228
15/08/14 09:25:06 INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/08/14 09:25:06 INFO deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
15/08/14 09:25:06 INFO OrcInputFormat: FooterCacheHitRatio: 0/2
15/08/14 09:25:06 INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1439544306469 end=1439544306486 duration=17 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/08/14 09:25:06 INFO DefaultExecutionContext: Starting job: collect at SparkPlan.scala:84
15/08/14 09:25:06 INFO DAGScheduler: Got job 3 (collect at SparkPlan.scala:84) with 2 output partitions (allowLocal=false)
15/08/14 09:25:06 INFO DAGScheduler: Final stage: Stage 3(collect at SparkPlan.scala:84)
15/08/14 09:25:06 INFO DAGScheduler: Parents of final stage: List()
15/08/14 09:25:06 INFO DAGScheduler: Missing parents: List()
15/08/14 09:25:06 INFO DAGScheduler: Submitting Stage 3 (MappedRDD[32] at map at SparkPlan.scala:84), which has no missing parents
15/08/14 09:25:06 INFO MemoryStore: ensureFreeSpace(72088) called with curMem=1286285, maxMem=278302556
15/08/14 09:25:06 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 70.4 KB, free 264.1 MB)
15/08/14 09:25:06 INFO MemoryStore: ensureFreeSpace(46036) called with curMem=1358373, maxMem=278302556
15/08/14 09:25:06 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 45.0 KB, free 264.1 MB)
15/08/14 09:25:06 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on sandbox.hortonworks.com:43599 (size: 45.0 KB, free: 265.2 MB)
15/08/14 09:25:06 INFO BlockManagerMaster: Updated info of block broadcast_7_piece0
15/08/14 09:25:06 INFO DefaultExecutionContext: Created broadcast 7 from broadcast at DAGScheduler.scala:838
15/08/14 09:25:06 INFO DAGScheduler: Submitting 2 missing tasks from Stage 3 (MappedRDD[32] at map at SparkPlan.scala:84)
15/08/14 09:25:06 INFO YarnClientClusterScheduler: Adding task set 3.0 with 2 tasks
15/08/14 09:25:06 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6, sandbox.hortonworks.com, NODE_LOCAL, 1366 bytes)
15/08/14 09:25:06 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on sandbox.hortonworks.com:59036 (size: 45.0 KB, free: 265.3 MB)
15/08/14 09:25:06 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on sandbox.hortonworks.com:59036 (size: 41.9 KB, free: 265.3 MB)
15/08/14 09:25:06 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, sandbox.hortonworks.com, NODE_LOCAL, 1366 bytes)
15/08/14 09:25:06 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 6) in 311 ms on sandbox.hortonworks.com (1/2)
15/08/14 09:25:07 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 7) in 119 ms on sandbox.hortonworks.com (2/2)
15/08/14 09:25:07 INFO YarnClientClusterScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool 
[Michael,29]
[Andy,30]
[Justin,19]

scala> 15/08/14 09:25:07 INFO DAGScheduler: Stage 3 (collect at SparkPlan.scala:84) finished in 0.427 s
15/08/14 09:25:07 INFO DAGScheduler: Job 3 finished: collect at SparkPlan.scala:84, took 0.504132 s

存储到orc表也很顺利..

scala> peopleSchemaRDD.saveAsOrcFile("person_orc_table") 
15/08/14 09:28:20 INFO DefaultExecutionContext: Starting job: runJob at OrcTableOperations.scala:154
15/08/14 09:28:20 INFO DAGScheduler: Got job 4 (runJob at OrcTableOperations.scala:154) with 2 output partitions (allowLocal=false)
15/08/14 09:28:20 INFO DAGScheduler: Final stage: Stage 4(runJob at OrcTableOperations.scala:154)
15/08/14 09:28:20 INFO DAGScheduler: Parents of final stage: List()
15/08/14 09:28:20 INFO DAGScheduler: Missing parents: List()
15/08/14 09:28:20 INFO DAGScheduler: Submitting Stage 4 (MapPartitionsRDD[35] at mapPartitions at OrcTableOperations.scala:70), which has no missing parents
15/08/14 09:28:20 INFO MemoryStore: ensureFreeSpace(72048) called with curMem=965233, maxMem=278302556
15/08/14 09:28:20 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 70.4 KB, free 264.4 MB)
15/08/14 09:28:20 INFO MemoryStore: ensureFreeSpace(46093) called with curMem=1037281, maxMem=278302556
15/08/14 09:28:20 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 45.0 KB, free 264.4 MB)
15/08/14 09:28:20 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on sandbox.hortonworks.com:43599 (size: 45.0 KB, free: 265.2 MB)
15/08/14 09:28:20 INFO BlockManagerMaster: Updated info of block broadcast_8_piece0
15/08/14 09:28:20 INFO DefaultExecutionContext: Created broadcast 8 from broadcast at DAGScheduler.scala:838
15/08/14 09:28:20 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 (MapPartitionsRDD[35] at mapPartitions at OrcTableOperations.scala:70)
15/08/14 09:28:20 INFO YarnClientClusterScheduler: Adding task set 4.0 with 2 tasks
15/08/14 09:28:20 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, sandbox.hortonworks.com, NODE_LOCAL, 1314 bytes)
15/08/14 09:28:20 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on sandbox.hortonworks.com:59036 (size: 45.0 KB, free: 265.3 MB)
15/08/14 09:28:21 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 9, sandbox.hortonworks.com, NODE_LOCAL, 1314 bytes)
15/08/14 09:28:21 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 503 ms on sandbox.hortonworks.com (1/2)
15/08/14 09:28:21 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 9) in 69 ms on sandbox.hortonworks.com (2/2)
15/08/14 09:28:21 INFO DAGScheduler: Stage 4 (runJob at OrcTableOperations.scala:154) finished in 0.570 s
15/08/14 09:28:21 INFO YarnClientClusterScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool 
15/08/14 09:28:21 INFO DAGScheduler: Job 4 finished: runJob at OrcTableOperations.scala:154, took 0.615483 s

scala> 15/08/14 09:28:35 INFO BlockManager: Removing broadcast 8
15/08/14 09:28:35 INFO BlockManager: Removing block broadcast_8
15/08/14 09:28:35 INFO MemoryStore: Block broadcast_8 of size 72048 dropped from memory (free 277291230)
15/08/14 09:28:35 INFO BlockManager: Removing block broadcast_8_piece0
15/08/14 09:28:35 INFO MemoryStore: Block broadcast_8_piece0 of size 46093 dropped from memory (free 277337323)
15/08/14 09:28:35 INFO BlockManagerInfo: Removed broadcast_8_piece0 on sandbox.hortonworks.com:43599 in memory (size: 45.0 KB, free: 265.3 MB)
15/08/14 09:28:35 INFO BlockManagerMaster: Updated info of block broadcast_8_piece0
15/08/14 09:28:35 INFO BlockManagerInfo: Removed broadcast_8_piece0 on sandbox.hortonworks.com:59036 in memory (size: 45.0 KB, free: 265.4 MB)
15/08/14 09:28:35 INFO ContextCleaner: Cleaned broadcast 8

我甚至可以按如下方式检索兽人表。
val morePeople = hiveContext.orcFile(&#34; person_orc_table&#34;) morePeople.registerTempTable(&#34; morePeople&#34;)

scala> hiveContext.sql("SELECT * from morePeople").collect.foreach(println)
15/08/14 09:33:32 INFO ParseDriver: Parsing command: SELECT * from morePeople
15/08/14 09:33:32 INFO ParseDriver: Parse Completed
15/08/14 09:33:32 INFO OrcFileOperator: Qualified file list: 
15/08/14 09:33:32 INFO OrcFileOperator: hdfs://sandbox.hortonworks.com:8020/user/root/people.orc/part-r-0-1439544199994.orc
15/08/14 09:33:32 INFO OrcFileOperator: hdfs://sandbox.hortonworks.com:8020/user/root/people.orc/part-r-1-1439544200299.orc
15/08/14 09:33:32 INFO MemoryStore: ensureFreeSpace(278167) called with curMem=965233, maxMem=278302556
15/08/14 09:33:32 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 271.6 KB, free 264.2 MB)
15/08/14 09:33:32 INFO MemoryStore: ensureFreeSpace(42885) called with curMem=1243400, maxMem=278302556
15/08/14 09:33:32 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 41.9 KB, free 264.2 MB)
15/08/14 09:33:32 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on sandbox.hortonworks.com:43599 (size: 41.9 KB, free: 265.2 MB)
15/08/14 09:33:32 INFO BlockManagerMaster: Updated info of block broadcast_11_piece0
15/08/14 09:33:32 INFO DefaultExecutionContext: Created broadcast 11 from hadoopRDD at OrcTableOperations.scala:228
15/08/14 09:33:32 INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/08/14 09:33:32 INFO OrcInputFormat: FooterCacheHitRatio: 0/2
15/08/14 09:33:32 INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1439544812311 end=1439544812318 duration=7 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/08/14 09:33:32 INFO DefaultExecutionContext: Starting job: collect at SparkPlan.scala:84
15/08/14 09:33:32 INFO DAGScheduler: Got job 6 (collect at SparkPlan.scala:84) with 2 output partitions (allowLocal=false)
15/08/14 09:33:32 INFO DAGScheduler: Final stage: Stage 6(collect at SparkPlan.scala:84)
15/08/14 09:33:32 INFO DAGScheduler: Parents of final stage: List()
15/08/14 09:33:32 INFO DAGScheduler: Missing parents: List()
15/08/14 09:33:32 INFO DAGScheduler: Submitting Stage 6 (MappedRDD[48] at map at SparkPlan.scala:84), which has no missing parents
15/08/14 09:33:32 INFO MemoryStore: ensureFreeSpace(72088) called with curMem=1286285, maxMem=278302556
15/08/14 09:33:32 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 70.4 KB, free 264.1 MB)
15/08/14 09:33:32 INFO MemoryStore: ensureFreeSpace(46036) called with curMem=1358373, maxMem=278302556
15/08/14 09:33:32 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 45.0 KB, free 264.1 MB)
15/08/14 09:33:32 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on sandbox.hortonworks.com:43599 (size: 45.0 KB, free: 265.2 MB)
15/08/14 09:33:32 INFO BlockManagerMaster: Updated info of block broadcast_12_piece0
15/08/14 09:33:32 INFO DefaultExecutionContext: Created broadcast 12 from broadcast at DAGScheduler.scala:838
15/08/14 09:33:32 INFO DAGScheduler: Submitting 2 missing tasks from Stage 6 (MappedRDD[48] at map at SparkPlan.scala:84)
15/08/14 09:33:32 INFO YarnClientClusterScheduler: Adding task set 6.0 with 2 tasks
15/08/14 09:33:32 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 12, sandbox.hortonworks.com, NODE_LOCAL, 1366 bytes)
15/08/14 09:33:32 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on sandbox.hortonworks.com:59036 (size: 45.0 KB, free: 265.3 MB)
15/08/14 09:33:32 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on sandbox.hortonworks.com:59036 (size: 41.9 KB, free: 265.3 MB)
15/08/14 09:33:32 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 13, sandbox.hortonworks.com, NODE_LOCAL, 1366 bytes)
15/08/14 09:33:32 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 12) in 153 ms on sandbox.hortonworks.com (1/2)
15/08/14 09:33:32 INFO TaskSetManager: Finished task 1.0 in stage 6.0 (TID 13) in 106 ms on sandbox.hortonworks.com (2/2)
15/08/14 09:33:32 INFO YarnClientClusterScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool 
15/08/14 09:33:32 INFO DAGScheduler: Stage 6 (collect at SparkPlan.scala:84) finished in 0.255 s
[Michael,29]
[Andy,30]
[Justin,19]

但是当我在hive上下文中激活查询以显示记录时,我无法看到任何记录..

hive> select * from person_orc_table;
OK
Time taken: 0.097 seconds
hive> 

我期待hive表中的数据/记录。但它不存在,我在这里缺少什么?

0 个答案:

没有答案