我有一个外部Hive表指向通过s3上的Spark作业写的Parquet文件,它有日期,时间戳字段,当我通过hive查询时,我得到正确的日期
CREATE EXTERNAL TABLE events(
event_date date,
event_timestamp timestamp,
event_name string,
event_category string
PARTITIONED BY (
dateid int,
STORED AS PARQUET
LOCATION 's3a://somebucket/events'
hive> SELECT event_timestamp, event_date from events limit 10;
2017-01-02 13:40:23 2017-01-02
2017-01-02 13:40:23.013 2017-01-02
2017-01-02 13:40:23.419 2017-01-02
2017-01-02 18:51:57.637 2017-01-02
2017-01-02 18:52:03.512 2017-01-02
2017-01-02 18:52:03.769 2017-01-02
2017-01-02 18:52:30.945 2017-01-02
2017-01-02 18:52:32.757 2017-01-02
2017-01-02 18:52:37.083 2017-01-02
2017-01-02 18:52:38.099 2017-01-02
但是,当我通过在EMR集群版本(emr-5.6.0)上运行的presto(版本0.170)运行它时,我看到所有日期都是1970-01-01
presto-cli --catalog hive --schema default
presto:default> SELECT event_timestamp, event_date from events limit 10;
1970-01-01 00:00:17.197 | 1970-01-01
1970-01-01 00:00:17.197 | 1970-01-01
1970-01-01 00:00:17.197 | 1970-01-01
1970-01-01 00:00:17.197 | 1970-01-01
1970-01-01 00:00:17.197 | 1970-01-01
1970-01-01 00:00:17.197 | 1970-01-01
1970-01-01 00:00:17.197 | 1970-01-01
1970-01-01 00:00:17.197 | 1970-01-01
1970-01-01 00:00:17.197 | 1970-01-01
1970-01-01 00:00:17.197 | 1970-01-01
Hive中的时间戳字段是否有任何未解决的问题,并通过Presto查询Parquet?
答案 0 :(得分:0)
经过所有在线研究并无处可去,我对镶木地板文件和hive DDL语句中的字段顺序进行了比较,看来在Spark作业开发过程中字段的顺序发生了变化。虽然hive能够通过名称读取列,但presto按顺序进行。因此,一个愚蠢的错误导致了一场疯狂的追逐。无论如何,在这里无耻地回答我自己的问题来关闭线程。