Spark-日期与时间戳比较-废话结果`2018-01-01`比`2018-01-01 00:00:00'少

时间:2018-10-01 23:13:35

标签: apache-spark pyspark apache-spark-sql

我遇到了Spark问题,并且将Dates与Timestamps进行比较,我只是不知道发生了什么。

这是要复制的代码(pyspark)

query = '''with data as (
    select date('2018-01-01') as d
        , timestamp('2018-01-01') as t
)
select d < t as natural_lt
    , d = t as natural_eq
    , d > t as natural_gt
    , d < date(t) as cast_date_lt
    , d = date(t) as cast_date_eq
    , d > date(t) as cast_date_gt
    , timestamp(d) < t as cast_timestamp_lt
    , timestamp(d) = t as cast_timestamp_eq
    , timestamp(d) > t as cast_timestamp_gt
from data
'''
spark.sql(query).show()

结果:

+----------+----------+----------+------------+------------+------------+-----------------+-----------------+-----------------+
|natural_lt|natural_eq|natural_gt|cast_date_lt|cast_date_eq|cast_date_gt|cast_timestamp_lt|cast_timestamp_eq|cast_timestamp_gt|
+----------+----------+----------+------------+------------+------------+-----------------+-----------------+-----------------+
|      true|     false|     false|       false|        true|       false|            false|             true|            false|
+----------+----------+----------+------------+------------+------------+-----------------+-----------------+-----------------+

这完全违反了我的期望。我们发现"2018-01-01""2018-01-01 00:00:00"少-显然,在此日期00:00:00之前没有任何内容,因此我认为这与直觉相反。

我希望是一个例外(比较日期与时间戳是模棱两可的),还是希望它通过强制转换将它们与时间戳进行比较(或将两者2018-01-01都比较为2018-01-01 00:00:00)。

谁能解释为什么会进行这种比较?更重要的是,我能否让Spark表现出预期?我可以让Spark抛出异常吗?

1 个答案:

答案 0 :(得分:1)

这是因为时间戳和日期都向下转换为字符串,这会导致意外结果。

这是您的查询的分析逻辑计划:

 +- Project [(cast(d#46 as string) < cast(t#47 as string)) AS natural_lt#37, (cast(d#46 as string) = cast(t#47 as string)) AS natural_eq#38, (cast(d#46 as string) > cast(t#47 as string)) AS natural_gt#39, (d#46 < cast(t#47 as date)) AS cast_date_lt#40, (d#46 = cast(t#47 as date)) AS cast_date_eq#41, (d#46 > cast(t#47 as date)) AS cast_date_gt#42, (cast(d#46 as timestamp) < t#47) AS cast_timestamp_lt#43, (cast(d#46 as timestamp) = t#47) AS cast_timestamp_eq#44, (cast(d#46 as timestamp) > t#47) AS cast_timestamp_gt#45]

Jira:https://issues.apache.org/jira/browse/SPARK-23549(修复版本:2.4.0)