Spark - 将日期字符串格式为dd-MMM-YY转换为时间戳

时间:2018-01-15 12:59:46

标签: apache-spark spark-dataframe

我正在尝试将包含以下内容的csv文件解析为dataframe:

+------+---------+----------+
|Symbol|     Date|ClosePrice|
+------+---------+----------+
| SREEL| 1-Jan-14|     298.0|
| SREEL| 2-Jan-14|     299.9|
+------+---------+----------+

但我无法根据代码段将日期字段转换为时间戳字段。它给了我不正确的结果。

任何人都可以帮助我理解相同的原因吗?

    val sparkConf = new SparkConf().setAppName("TimeSeriesForecast").setMaster("local")
    sparkConf.set("spark.sql.shuffle.partitions", "4")
    val sparkContext = new SparkContext(sparkConf)
    val sqlContext = new SQLContext(sparkContext)
    val stockDF: DataFrame = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("data/Sreeleathers_Share_Price.csv")

    val priceDF: DataFrame = stockDF.select(stockDF("Symbol").as("Symbol"),
      stockDF("Date").as("Date"),
      stockDF("Close Price").as("ClosePrice"))

    //priceDF.printSchema
    //priceDF.show

    import sqlContext.implicits._

    val finalDf: DataFrame = priceDF
      .withColumn("Price", priceDF("ClosePrice").cast(DoubleType))
      .withColumn("TimeStamp", unix_timestamp($"Date","d-MMM-yy").cast(TimestampType))
      .drop("Date").drop("ClosePrice")
      .sort("TimeStamp")

1 个答案:

答案 0 :(得分:0)

我尝试跟随火花1.6,它似乎正在工作。发布答案因为评论时间太长

val myDF = Seq(("1-Jan-14", 2, 1L), ("2-Jan-14", 1, 2L)).toDF("Date", "col2", "col3")
myDF.show()
+--------+----+----+
|    Date|col2|col3|
+--------+----+----+
|1-Jan-14|   2|   1|
|2-Jan-14|   1|   2|
+--------+----+----+

myDF.withColumn("TimeStamp", unix_timestamp($"Date","d-MMM-yy").cast(TimestampType)).show()
+--------+----+----+--------------------+
|    Date|col2|col3|           TimeStamp|
+--------+----+----+--------------------+
|1-Jan-14|   2|   1|2014-01-01 00:00:...|
|2-Jan-14|   1|   2|2014-01-02 00:00:...|
+--------+----+----+--------------------+