将datetime对象与8601字符串进行比较会得到错误的结果,为什么允许它?

时间:2016-07-01 17:15:32

标签: python datetime apache-spark pyspark apache-spark-sql

也许我错过了一些明显的东西,但看起来日期比较就在这里:

from pyspark import SparkContext
from pyspark.sql import SQLContext

import pandas as pd
import datetime

sc = SparkContext()
sqlContext = SQLContext(sc)

action_pandas_df = pd.DataFrame({"customerId": ["Cat", "Hat", "Bat"],
                                 "timeStamp": ["2016-06-29T09:11:26Z",
                                               "2016-07-30T09:11:26Z",
                                               "2016-06-29T23:11:26Z"]})

action_df = sqlContext.createDataFrame(action_pandas_df)
action_df.show()

cut_off = datetime.datetime(2016, 6, 29, 15)

print "\033[0;34m{}\033[0m".format(cut_off.strftime(format='%Y-%m-%dT%H:%M:%SZ'))

new_df = action_df.filter(action_df.timeStamp > cut_off)
new_df.show()

我明白了:

+----------+--------------------+
|customerId|           timeStamp|
+----------+--------------------+
|       Cat|2016-06-29T09:11:26Z|
|       Hat|2016-07-30T09:11:26Z|
|       Bat|2016-06-29T23:11:26Z|
+----------+--------------------+

我不明白为什么Cat 2016-06-29T09:11:26Z上的日期被认为大于cut_off的日期,2016-06-29T15:00:00Z

我知道我可以使用cut_off.strftime(format='%Y-%m-%dT%H:%M:%SZ')代替cutoff,我会得到预期的结果。

作为旁注:

对于较大的cut_off日期,我得到了预期的结果:

cut_off = datetime.datetime(2016, 7, 10, 15) 

我的代码按预期工作。

为什么允许将datatime对象与8601字符串进行比较?

我错过了什么?

编辑:

我正在使用Spark 1.5

编辑2:

Spark 1.6.1给出了相同的行为。

1 个答案:

答案 0 :(得分:3)

因为您没有比较日期。由于类型不匹配且列是字符串,因此查询也会转换为字符串。 SQL中cut_off的字符串表示形式为2016-06-29 15:00:00

from pyspark.sql.functions import lit

cut_off = datetime.datetime(2016, 6, 29, 15)

action_df.select(lit(cut_off).cast("string")).limit(1).show()
## +--------------------------------+
## |cast(1467205200000000 as string)|
## +--------------------------------+
## |             2016-06-29 15:00:00|
## +--------------------------------+

当您使用词典顺序和'T'>比较字符串时' '。您可以与格式化字符串进行比较:

cut_off_str = cut_off.strftime(format='%Y-%m-%dT%H:%M:%SZ')

action_df.where(action_df.timeStamp > cut_off_str).show()
## +----------+--------------------+
## |customerId|           timeStamp|
## +----------+--------------------+
## |       Hat|2016-07-30T09:11:26Z|
## |       Bat|2016-06-29T23:11:26Z|
## +----------+--------------------+
首先

或解析列:

from pyspark.sql.functions import unix_timestamp

timestamp_parsed = (unix_timestamp(action_df.timeStamp, "yyyy-MM-dd'T'kk:mm:ss")
    .cast("double")      # Required only for Spark 1.5
    .cast("timestamp"))

action_df.where(timestamp_parsed > cut_off).show()
## +----------+--------------------+
## |customerId|           timeStamp|
## +----------+--------------------+
## |       Hat|2016-07-30T09:11:26Z|
## |       Bat|2016-06-29T23:11:26Z|
## +----------+--------------------+