Question

我有以下形式的csv：

t,value
2012-01-12 12:30:00,4
2012-01-12 12:45:00,3
2012-01-12 12:00:00,12
2012-01-12 12:15:00,13
2012-01-12 13:00:00,7

我使用spark-csv将其转换为数据框。（因此t为String类型，value为整数类型）。什么是适当的火花标量方式，以便输出按时间排序？

我正在考虑将t转换为允许数据框sortBy的特定类型。但我不熟悉哪种时间戳类型允许按时间排序数据帧。

Answer 1

根据格式，您可以将时间戳转换为

import org.apache.spark.sql.types.TimestampType

df.select($"t".cast(TimestampType)) // or df.select($"t".cast("timestamp"))

获取正确的日期时间或使用unix_timestamp（Spark 1.5+，在Spark＆lt; 1.5中，您可以使用同名的Hive UDF）功能：

import org.apache.spark.sql.functions.unix_timestamp

df.select(unix_timestamp($"t"))

获取数字表示（秒中的Unix时间戳）。

另一方面，没有理由你不能直接orderBy($"t")。字典顺序应该在这里工作得很好。

Answer 2

除了@ zero323之外，如果您正在编写纯SQL，您可以按如下方式使用CAST运算符：

df.registerTempTable("myTable")    
sqlContext.sql("SELECT CAST(t as timestamp) FROM myTable")

Answer 3

如果使用'df.select'进行强制转换，则可能只会得到指定的列。要更改指定列的类型，并保留其他列，请应用'df.withColumn'并传递原始列名。

import org.apache.spark.sql.types._

val df1 = df.withColumn("t",col("t").cast(TimestampType))

df1.printSchema
root
 |-- t: timestamp (nullable = true)
 |-- value: integer (nullable = true)

仅列名“ t”的数据类型被更改。其余的保留下来。

spark scala dataframe时间戳转换排序？

3 个答案: