我正在尝试从PySpark写一个DF到hive:
> new_hub_df.printSchema()
root
|-- ClientId: string (nullable = true)
|-- HUB_ID: string (nullable = true)
|-- publicID: string (nullable = true)
|-- Version: long (nullable = true)
> new_hub_df.show(2)
+--------+--------------------+--------------------+-------+
|ClientId| HUB_ID| publicID|Version|
+--------+--------------------+--------------------+-------+
| OPNF|49eff2084ecea86e9...|54102364-6251-4bd...| 1|
| OPNF|bab2e3fae1183ea69...|1f98cca0-316e-4ed...| 1|
+--------+--------------------+--------------------+-------+
only showing top 2 rows
> new_hub_df.write.saveAsTable("sb_party_hub_dev.party_hub", mode='overwrite', format="orc", partitionBy='ClientId')
我可以在Hive中看到我的表,但架构不正确:
party_hub
col (array<string>)
并且select返回错误:
> select * from party_Hub
java.io.IOException: java.io.IOException: adl://home/hive/warehouse/sb_party_hub_dev.db/party_hub/ClientId=OPNF/part-r-00003-a82a83a9-da1a-41af-97d0-f449bf0e1e69.snappy.orc not a SequenceFile
我该如何解决?
答案 0 :(得分:0)
如何创建临时表并将其另存为ORC。根据需要添加分区。
new_hub_df.registerTempTable("newhub")
sqlContext.sql("CREATE TABLE mytbl STORED AS ORC AS select * from newhub")