我有一个格式为
的字符串"5/02/2016" // d/ww/yyyy
我想转换为
的格式yyyy-mm-dd
我试过以下
val df = Seq((1L, "5/02/2016"), (2L, "aaa")).toDF("id", "date")
val ts = unix_timestamp($"date", "d/ww/yyyy").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
我得到了
//output
+---+---------+-----------+
|id |date |ts |
+---+---------+-----------+
|1 |5/02/2016|2016-01-05 |
|2 |aaa |null |
+---+---------+-----------+
我想要的时候
//expected
+---+---------+-----------+
|id |date |ts |
+---+---------+-----------+
|1 |5/02/2016|2016-01-19 |
|2 |aaa |null |
+---+---------+-----------+
答案 0 :(得分:1)
转换日期是一项棘手的事情。在这种情况下,闰年阻止我们将日期直接映射到一年中的某个月和一月中的某一天。
在Scala中,我们可以使用java.util.GregorianCalendar:
def weekToDate(weekStr: String) = {
val (day, week, year) = {
val arr = weekStr.split("/").map(_.toInt)
(arr(0), arr(1), arr(2))
}
val cal = new java.util.GregorianCalendar()
cal.set(java.util.Calendar.YEAR, year)
cal.set(java.util.Calendar.DAY_OF_YEAR, 7 * week + day)
new java.text.SimpleDateFormat("yyyy-MM-dd").format(cal.getTime)
}
weekToDate("5/02/2016") // res0: String = 2016-01-19
// Leap year example
weekToDate("4/08/2016") // res1: String = 2016-02-29
weekToDate("4/08/2017") // res2: String = 2017-03-01
全部放在一起:
import spark.implicits._
import org.apache.spark.sql.functions.udf
def weekToDate(weekStr: String) = {
val (day, week, year) = {
val arr = weekStr.split("/").map(_.toInt)
(arr(0), arr(1), arr(2))
}
val cal = new java.util.GregorianCalendar()
cal.set(java.util.Calendar.YEAR, year)
cal.set(java.util.Calendar.DAY_OF_YEAR, 7 * week + day)
new java.text.SimpleDateFormat("yyyy-MM-dd").format(cal.getTime)
}
val df = Seq((1L, "5/02/2016"), (2L, "4/8/2016")).toDF("id", "date").select("date")
val wfn: String => String = weekToDate(_)
val tsUDF=udf(wfn)
df.withColumn("ts", tsUDF('date)).show(2, false)
+---------+----------+
|date |ts |
+---------+----------+
|5/02/2016|2016-01-19|
|4/8/2016 |2016-02-29|
+---------+----------+
答案 1 :(得分:1)
正如@puhlen所指出的那样,星期几应该是u
,而不是d
(见SimpleDateFormat)
val df = Seq((1L, "5/02/2016"), (2L, "aaa")).toDF("id", "date")
val ts = unix_timestamp($"date", "u/ww/yyyy").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
+---+---------+---------------------+
|id |date |ts |
+---+---------+---------------------+
|1 |5/02/2016|2016-01-08 00:00:00.0|
|2 |aaa |null |
+---+---------+---------------------+
另请注意,您不应期望5/02/2016
与2 x 7 + 5
的{{1}}天算术相同。你应该检查2016年日历,第二周的星期五实际上是1月8日。