Question

我有一个Spark数据帧，其中包含一个字段作为时间戳。我将数据帧存储到创建hive外部表的HDFS位置。 Hive表包含具有时间戳类型的字段。但是，从外部位置配置单元读取数据时，会将时间戳字段填充为表中的空白值。我的火花数据帧查询：

df.select($"ipAddress", $"clientIdentd", $"userId", to_timestamp(unix_timestamp($"dateTime", "dd/MMM/yyyy:HH:mm:ss Z").cast("timestamp")).as("dateTime"), $"method", $"endpoint", $"protocol", $"responseCode", $"contentSize", $"referrerURL", $"browserInfo")

Hive create table statement：

CREATE EXTERNAL TABLE `finalweblogs3`(
   `ipAddress` string,
   `clientIdentd` string,
   `userId` string,
   `dateTime` timestamp,
   `method` string,
   `endpoint` string,
   `protocol` string,
   `responseCode` string,
   `contentSize` string,
   `referrerURL` string,
   `browserInfo` string)
 ROW FORMAT SERDE
   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
 WITH SERDEPROPERTIES (
   'field.delim'=',',
   'serialization.format'=',')
 STORED AS INPUTFORMAT
   'org.apache.hadoop.mapred.TextInputFormat'
 OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
   'hdfs://localhost:9000/streaming/spark/finalweblogs3'

我无法理解为什么会发生这种情况。

Answer 1

我通过将存储格式更改为“Parquet”来解决它。我仍然不知道为什么它不适用于CSV格式。

使用配置单元时间戳不接受Spark时间戳类型

1 个答案: