Pyspark 2.1:
我创建了一个数据名人堂,并且有一个timestamp列,我将其转换为unix时间戳。但是,从unix时间戳派生的列不正确。随着时间戳的增加,unix_timestamp也应增加,但是事实并非如此。您可以从下面的代码中看到一个示例。请注意,在对timestamp变量和unix_ts变量进行排序时,将获得不同的顺序。
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
("a", "1", "2018-01-08 23:03:23.325359"),
("a", "2", "2018-01-09 00:03:23.325359"),
("a", "3", "2018-01-09 00:03:25.025240"),
("a", "4", "2018-01-09 00:03:27.025240"),
("a", "5", "2018-01-09 00:08:27.021240"),
("a", "6", "2018-01-09 03:03:27.025240"),
("a", "7", "2018-01-09 05:03:27.025240"),
], ["person_id", "session_id", "timestamp"])
df = df.withColumn("unix_ts",F.unix_timestamp(F.col("timestamp"), "yyyy-MM-dd HH:mm:ss.SSSSSS"))
df.orderBy("timestamp").show(10,False)
df.orderBy("unix_ts").show(10,False)
输出:
+---------+----------+--------------------------+----------+
|person_id|session_id|timestamp |unix_ts |
+---------+----------+--------------------------+----------+
|a |1 |2018-01-08 23:03:23.325359|1515474528|
|a |2 |2018-01-09 00:03:23.325359|1515478128|
|a |3 |2018-01-09 00:03:25.025240|1515477830|
|a |4 |2018-01-09 00:03:27.025240|1515477832|
|a |5 |2018-01-09 00:08:27.021240|1515478128|
|a |6 |2018-01-09 03:03:27.025240|1515488632|
|a |7 |2018-01-09 05:03:27.025240|1515495832|
+---------+----------+--------------------------+----------+
+---------+----------+--------------------------+----------+
|person_id|session_id|timestamp |unix_ts |
+---------+----------+--------------------------+----------+
|a |1 |2018-01-08 23:03:23.325359|1515474528|
|a |3 |2018-01-09 00:03:25.025240|1515477830|
|a |4 |2018-01-09 00:03:27.025240|1515477832|
|a |5 |2018-01-09 00:08:27.021240|1515478128|
|a |2 |2018-01-09 00:03:23.325359|1515478128|
|a |6 |2018-01-09 03:03:27.025240|1515488632|
|a |7 |2018-01-09 05:03:27.025240|1515495832|
+---------+----------+--------------------------+----------+
这是错误还是我正在做某事/实施了此错误?
此外,您还可以看到2018-01-09 00:03:27.025240 and
2018-01-09 00:08:27.021240 produce the same unix_timestamp of
1515495832`
答案 0 :(得分:0)
问题似乎是Spark的unix_timestamp
在内部使用Java的SimpleDateFormat解析日期,而SimpleDateFormat不支持微秒(例如,见here)。此外,unix_timestamp
返回一个long,因此它的粒度只有几秒钟。
一种解决方法是仅在不包含微秒信息的情况下进行解析,然后将微秒分别添加回去:
df = spark.createDataFrame([
("a", "1", "2018-01-08 23:03:23.325359"),
("a", "2", "2018-01-09 00:03:23.325359"),
("a", "3", "2018-01-09 00:03:25.025240"),
("a", "4", "2018-01-09 00:03:27.025240"),
("a", "5", "2018-01-09 00:08:27.021240"),
("a", "6", "2018-01-09 03:03:27.025240"),
("a", "7", "2018-01-09 05:03:27.025240"),
], ["person_id", "session_id", "timestamp"])
# parse the timestamp up to the seconds place
df = df.withColumn("unix_ts_sec",f.unix_timestamp(f.substring(f.col("timestamp"), 1, 19), "yyyy-MM-dd HH:mm:ss"))
# extract the microseconds
df = df.withColumn("microsec", f.substring(f.col("timestamp"), 21, 6).cast('int'))
# add to get full epoch time accurate to the microsecond
df = df.withColumn("unix_ts", f.col("unix_ts_sec") + 1e-6 * f.col("microsec"))
侧面说明:我无法轻松访问Spark 2.1,但是使用Spark 2.2时,unix_ts
的空值与最初编写的一样。您似乎遇到了某种Spark 2.1错误,为您提供了无用的时间戳记。