下面的spark数据帧具有unix格式的start_t和end_t,但其中具有指数e。
+------+----------------+------------------+--------+----------+----------+-------+-----------+-----------+-----------+-------------+-------+---------------+----------------+
| alt_t| end_t|engine_fuel_rate_t| lat_t|left_max_t|left_min_t| lon_t|plm3_incl_t|right_max_t|right_min_t|road_class_u8|speed_t|sprung_weight_t| start_t|
+------+----------------+------------------+--------+----------+----------+-------+-----------+-----------+-----------+-------------+-------+---------------+----------------+
|1237.5|1.521956985733E9| 0|-27.7314| 0.0| 0.0|22.9552| 1.5| 0.0| 0.0| 0| 17.4| 198.0| 1.52195698056E9|
|1236.5|1.521956989922E9| 0|-27.7316| 0.0| 0.0|22.9552| -3.3| 0.0| 0.0| 0| 17.6| 156.1|1.521956985733E9|
|1234.5|1.521956995378E9| 0|-27.7318| 0.0| 0.0|22.9552| -2.7| 0.0| 0.0| 0| 11.9| 148.6|1.521956989922E9|
|1230.5|1.521957001498E9| 0| -27.732| 0.0| 0.0|22.9551| 2.3| 0.0| 0.0| 0| 13.2| 169.1|1.521956995378E9|
由于它是double,因此不能直接转换为时间戳。它将通过错误指出它需要为字符串。
+------+----------------+------------------+--------+----------+----------+-------+-----------+-----------+-----------+-------------+-------+---------------+-------+
| alt_t| end_t|engine_fuel_rate_t| lat_t|left_max_t|left_min_t| lon_t|plm3_incl_t|right_max_t|right_min_t|road_class_u8|speed_t|sprung_weight_t|start_t|
+------+----------------+------------------+--------+----------+----------+-------+-----------+-----------+-----------+-------------+-------+---------------+-------+
|1237.5|1.521956985733E9| 0|-27.7314| 0.0| 0.0|22.9552| 1.5| 0.0| 0.0| 0| 17.4| 198.0| null|
|1236.5|1.521956989922E9| 0|-27.7316| 0.0| 0.0|22.9552| -3.3| 0.0| 0.0| 0| 17.6| 156.1| null|
|1234.5|1.521956995378E9| 0|-27.7318| 0.0| 0.0|22.9552| -2.7| 0.0| 0.0| 0| 11.9| 148.6| null|
因此,我使用了以下代码:
%scala
val df2 = df.withColumn("start_t", df("start_t").cast("string"))
val df3 = df2.withColumn("end_t", df("end_t").cast("string"))
val filteredDF = df3.withColumn("start_t", unix_timestamp($"start_t", "yyyyMMddHHmmss").cast("timestamp"))
filteredDF.show()
我在start_t中得到null,并认为其归因于E(指数符号)。我已经在pandas python中对其进行了测试,日期有效并且输出结果。我知道有一种方法可以使用精度来更改此设置。 我正在尝试将其转换为yyyy-MM-dd HH:mm:ss格式的时间戳,并为时间和日期设置单独的列。
注意:提出了类似的问题,但未回答。 Scala Spark : Convert Double Column to Date Time Column in dataframe
答案 0 :(得分:0)
您应该可以将时间戳转换为double,如下所示
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala>
| val df = Seq((1237.5,1.521956985733E9),
| (1236.5,1.521956989922E9),
| (1234.5,1.521956995378E9),
| (1230.5,1.521957001498E9)).toDF("alt_t","end_t")
df: org.apache.spark.sql.DataFrame = [alt_t: double, end_t: double]
scala>
scala> df.printSchema
root
|-- alt_t: double (nullable = false)
|-- end_t: double (nullable = false)
scala>
scala> df.withColumn("end_t",$"end_t".cast("timestamp")).show
+------+--------------------+
| alt_t| end_t|
+------+--------------------+
|1237.5|2018-03-25 05:49:...|
|1236.5|2018-03-25 05:49:...|
|1234.5|2018-03-25 05:49:...|
|1230.5|2018-03-25 05:50:...|
+------+--------------------+
答案 1 :(得分:0)
通过String-> Double-> Timestamp锁定转换。下面的作品
scala> val df = Seq(("1237.5","1.521956985733E9"),("1236.5","1.521956989922E9"),("1234.5","1.521956995378E9"),("1230.5","1.521957001498E9")).toDF("alt_t","end_t")
df: org.apache.spark.sql.DataFrame = [alt_t: string, end_t: string]
scala> df.withColumn("end_t",'end_t.cast("double").cast("timestamp")).show(false)
+------+-----------------------+
|alt_t |end_t |
+------+-----------------------+
|1237.5|2018-03-25 01:49:45.733|
|1236.5|2018-03-25 01:49:49.922|
|1234.5|2018-03-25 01:49:55.378|
|1230.5|2018-03-25 01:50:01.498|
+------+-----------------------+
scala>
UPDATE1
scala> val df = Seq(("1237.5","1.521956985733E9"),("1236.5","1.521956989922E9"),("1234.5","1.521956995378E9"),("1230.5","1.521957001498E9")).toDF("alt_t","end_t").withColumn("end_t",'end_t.cast("double").cast("timestamp"))
df: org.apache.spark.sql.DataFrame = [alt_t: string, end_t: timestamp]
scala> df.printSchema
root
|-- alt_t: string (nullable = true)
|-- end_t: timestamp (nullable = true)
scala>