也许我错过了一些明显的东西,但看起来日期比较就在这里:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
import datetime
sc = SparkContext()
sqlContext = SQLContext(sc)
action_pandas_df = pd.DataFrame({"customerId": ["Cat", "Hat", "Bat"],
"timeStamp": ["2016-06-29T09:11:26Z",
"2016-07-30T09:11:26Z",
"2016-06-29T23:11:26Z"]})
action_df = sqlContext.createDataFrame(action_pandas_df)
action_df.show()
cut_off = datetime.datetime(2016, 6, 29, 15)
print "\033[0;34m{}\033[0m".format(cut_off.strftime(format='%Y-%m-%dT%H:%M:%SZ'))
new_df = action_df.filter(action_df.timeStamp > cut_off)
new_df.show()
我明白了:
+----------+--------------------+
|customerId| timeStamp|
+----------+--------------------+
| Cat|2016-06-29T09:11:26Z|
| Hat|2016-07-30T09:11:26Z|
| Bat|2016-06-29T23:11:26Z|
+----------+--------------------+
我不明白为什么Cat 2016-06-29T09:11:26Z
上的日期被认为大于cut_off
的日期,2016-06-29T15:00:00Z
。
我知道我可以使用cut_off.strftime(format='%Y-%m-%dT%H:%M:%SZ')
代替cutoff
,我会得到预期的结果。
作为旁注:
对于较大的cut_off
日期,我得到了预期的结果:
cut_off = datetime.datetime(2016, 7, 10, 15)
我的代码按预期工作。
为什么允许将datatime
对象与8601字符串进行比较?
我错过了什么?
编辑:
我正在使用Spark 1.5
编辑2:
Spark 1.6.1给出了相同的行为。
答案 0 :(得分:3)
因为您没有比较日期。由于类型不匹配且列是字符串,因此查询也会转换为字符串。 SQL中cut_off
的字符串表示形式为2016-06-29 15:00:00
:
from pyspark.sql.functions import lit
cut_off = datetime.datetime(2016, 6, 29, 15)
action_df.select(lit(cut_off).cast("string")).limit(1).show()
## +--------------------------------+
## |cast(1467205200000000 as string)|
## +--------------------------------+
## | 2016-06-29 15:00:00|
## +--------------------------------+
当您使用词典顺序和'T'
>比较字符串时' '
。您可以与格式化字符串进行比较:
cut_off_str = cut_off.strftime(format='%Y-%m-%dT%H:%M:%SZ')
action_df.where(action_df.timeStamp > cut_off_str).show()
## +----------+--------------------+
## |customerId| timeStamp|
## +----------+--------------------+
## | Hat|2016-07-30T09:11:26Z|
## | Bat|2016-06-29T23:11:26Z|
## +----------+--------------------+
首先或解析列:
from pyspark.sql.functions import unix_timestamp
timestamp_parsed = (unix_timestamp(action_df.timeStamp, "yyyy-MM-dd'T'kk:mm:ss")
.cast("double") # Required only for Spark 1.5
.cast("timestamp"))
action_df.where(timestamp_parsed > cut_off).show()
## +----------+--------------------+
## |customerId| timeStamp|
## +----------+--------------------+
## | Hat|2016-07-30T09:11:26Z|
## | Bat|2016-06-29T23:11:26Z|
## +----------+--------------------+