考虑以下DataFrame:
val df = Seq("20140101", "20170619")
.toDF("date")
.withColumn("date", to_date($"date", "yyyyMMdd"))
.withColumn("week", date_format($"date", "Y-ww"))
代码产生:
date: date
week: string
date week
2014-01-01 2014-01
2017-06-19 2017-25
我想做的是加粗数据框,因此我每周在2014-01
和2017-25
之间的间隔中留一行。 date
列并不重要,因此可以将其丢弃。
这需要在无数的客户/产品ID组合上完成,因此我正在寻找一种有效的解决方案,最好不要使用java.sql.date
和Spark中的内置日期功能。
答案 0 :(得分:1)
检查一下。我已使用默认的“星期日”作为星期开始编号。
scala> import java.time._
import java.time._
scala> import java.time.format._
import java.time.format._
scala> val a = java.sql.Date.valueOf("2014-01-01")
a: java.sql.Date = 2014-01-01
scala> val b = java.sql.Date.valueOf("2017-12-31")
b: java.sql.Date = 2017-12-31
scala> val a1 = a.toLocalDate.toEpochDay.toInt
a1: Int = 16071
scala> val b1 = b.toLocalDate.toEpochDay.toInt
b1: Int = 17531
scala> val c1 = (a1 until b1).map(LocalDate.ofEpochDay(_)).map(x => (x,x.format(DateTimeFormatter.ofPattern("Y-ww")),x.format(DateTimeFormatter.ofPattern("E")) ) ).filter( x=> x._3 =="Sun" ).map(x => (java.sql.Date.valueOf(x._1),x._2) ).toMap
c1: scala.collection.immutable.Map[java.sql.Date,String] = Map(2014-06-01 -> 2014-23, 2014-11-02 -> 2014-45, 2017-11-05 -> 2017-45, 2016-10-23 -> 2016-44, 2014-11-16 -> 2014-47, 2014-12-28 -> 2015-01, 2017-04-30 -> 2017-18, 2015-01-04 -> 2015-02, 2015-10-11 -> 2015-42, 2014-09-07 -> 2014-37, 2017-09-17 -> 2017-38, 2014-04-13 -> 2014-16, 2014-10-19 -> 2014-43, 2014-01-05 -> 2014-02, 2016-07-17 -> 2016-30, 2015-07-26 -> 2015-31, 2016-09-18 -> 2016-39, 2015-11-22 -> 2015-48, 2015-10-04 -> 2015-41, 2015-11-15 -> 2015-47, 2015-01-11 -> 2015-03, 2016-12-11 -> 2016-51, 2017-02-05 -> 2017-06, 2016-03-27 -> 2016-14, 2015-11-01 -> 2015-45, 2017-07-16 -> 2017-29, 2015-05-24 -> 2015-22, 2017-06-18 -> 2017-25, 2016-03-13 -> 2016-12, 2014-11-09 -> 2014-46, 2014-09-21 -> 2014-39, 2014-01-26 -> 2014-05...
scala> val df = Seq( (c1) ).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map<date,string>]
scala> val df2 = df.select(explode('a).as(Seq("dt","wk")) )
df2: org.apache.spark.sql.DataFrame = [dt: date, wk: string]
scala> df2.orderBy('dt).show(false)
+----------+-------+
|dt |wk |
+----------+-------+
|2014-01-05|2014-02|
|2014-01-12|2014-03|
|2014-01-19|2014-04|
|2014-01-26|2014-05|
|2014-02-02|2014-06|
|2014-02-09|2014-07|
|2014-02-16|2014-08|
|2014-02-23|2014-09|
|2014-03-02|2014-10|
|2014-03-09|2014-11|
|2014-03-16|2014-12|
|2014-03-23|2014-13|
|2014-03-30|2014-14|
|2014-04-06|2014-15|
|2014-04-13|2014-16|
|2014-04-20|2014-17|
|2014-04-27|2014-18|
|2014-05-04|2014-19|
|2014-05-11|2014-20|
|2014-05-18|2014-21|
+----------+-------+
only showing top 20 rows
scala>