转换一周至今的火花

时间:2017-05-04 14:59:29

标签: scala apache-spark apache-spark-sql

我有一个格式为

的字符串
"5/02/2016" // d/ww/yyyy

我想转换为

的格式
yyyy-mm-dd

我试过以下

val df = Seq((1L, "5/02/2016"), (2L, "aaa")).toDF("id", "date")
val ts = unix_timestamp($"date", "d/ww/yyyy").cast("timestamp")
df.withColumn("ts", ts).show(2, false)

我得到了

//output
+---+---------+-----------+
|id |date     |ts         |
+---+---------+-----------+
|1  |5/02/2016|2016-01-05 |
|2  |aaa      |null       |
+---+---------+-----------+

我想要的时候

//expected
+---+---------+-----------+
|id |date     |ts         |
+---+---------+-----------+
|1  |5/02/2016|2016-01-19 |
|2  |aaa      |null       |
+---+---------+-----------+

2 个答案:

答案 0 :(得分:1)

转换日期是一项棘手的事情。在这种情况下,闰年阻止我们将日期直接映射到一年中的某个月和一月中的某一天。

在Scala中,我们可以使用java.util.GregorianCalendar:

def weekToDate(weekStr: String) = {
  val (day, week, year) = {
    val arr = weekStr.split("/").map(_.toInt)
    (arr(0), arr(1), arr(2))
  }
  val cal = new java.util.GregorianCalendar()
  cal.set(java.util.Calendar.YEAR, year)
  cal.set(java.util.Calendar.DAY_OF_YEAR, 7 * week + day)
  new java.text.SimpleDateFormat("yyyy-MM-dd").format(cal.getTime)
}
weekToDate("5/02/2016") // res0: String = 2016-01-19

// Leap year example
weekToDate("4/08/2016") // res1: String = 2016-02-29
weekToDate("4/08/2017") // res2: String = 2017-03-01

全部放在一起:

import spark.implicits._
import org.apache.spark.sql.functions.udf

def weekToDate(weekStr: String) = {
  val (day, week, year) = {
    val arr = weekStr.split("/").map(_.toInt)
    (arr(0), arr(1), arr(2))
  }
  val cal = new java.util.GregorianCalendar()
  cal.set(java.util.Calendar.YEAR, year)
  cal.set(java.util.Calendar.DAY_OF_YEAR, 7 * week + day)
  new java.text.SimpleDateFormat("yyyy-MM-dd").format(cal.getTime)
}

val df = Seq((1L, "5/02/2016"), (2L, "4/8/2016")).toDF("id", "date").select("date")

val wfn: String => String = weekToDate(_)
val tsUDF=udf(wfn)
df.withColumn("ts", tsUDF('date)).show(2, false)

+---------+----------+
|date     |ts        |
+---------+----------+
|5/02/2016|2016-01-19|
|4/8/2016 |2016-02-29|
+---------+----------+

答案 1 :(得分:1)

正如@puhlen所指出的那样,星期几应该是u,而不是d(见SimpleDateFormat

val df = Seq((1L, "5/02/2016"), (2L, "aaa")).toDF("id", "date")
val ts = unix_timestamp($"date", "u/ww/yyyy").cast("timestamp")
df.withColumn("ts", ts).show(2, false)

+---+---------+---------------------+
|id |date     |ts                   |
+---+---------+---------------------+
|1  |5/02/2016|2016-01-08 00:00:00.0|
|2  |aaa      |null                 |
+---+---------+---------------------+

另请注意,您不应期望5/02/20162 x 7 + 5的{​​{1}}天算术相同。你应该检查2016年日历,第二周的星期五实际上是1月8日。