Question

我们正在编写来自spark的文件并从Athena / Hive读取。使用配置单元时，我们遇到了时间戳问题。

 scala> val someDF = Seq((8, "2018-06-06 11:42:43")).toDF("number", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]

scala> someDF.coalesce(1).write.mode("overwrite").option("delimiter", "\u0001").save("s3://test/")

这会创建一个镶木地板文件，我创建了一个表

CREATE EXTERNAL TABLE `test5`(
  `number` int, 
  `word` timestamp)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\u0001' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://test/'

选择查询失败并出现问题： HIVE_BAD_DATA：实木复合地板中的字段单词类型BINARY与表格模式中定义的类型时间戳不兼容

使用普通csv文件进行测试时，同样的事情正在发挥作用。

scala>someDF.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite").option("delimiter", "\u0001").save("s3://test")

Table:
CREATE EXTERNAL TABLE `test7`(
  `number` int, 
  `word` timestamp)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\u0001' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://test/'

当我们把它写成镶木地板文件时，你能帮忙解决出了什么问题。

Answer 1

我认为这是Hive存储镶木地板时间戳in a way that is incompatible with other tools的众所周知的错误。在使用Impala检索我用Spark编写的Hive数据时，我遇到了类似的问题。我相信Spark 2.3中的this was resolved。

Athena / Hive时间戳在由spark写的镶木地板文件中

1 个答案: