Spark-查找2周之间的全年周的范围

时间:2019-09-19 16:47:32

标签: scala date dataframe apache-spark hadoop

我需要查找给定星期之间的所有年度星期。

201824是一年一周的示例。它表示2018年的第24周。

假设一年中有52周,那么2018年的周周从201801开始,到201852结束。在那之后,它继续到201901。

如果起始周和结束周位于同一年,则我可以找到两周之间的全年周的范围

val range = udf((i: Int, j: Int) => (i to j).toArray)

以上代码仅在开始周和结束周在同一年时才有效,例如201912-201917

如果起始周和结束周属于不同的年份,该如何工作?

Example: 201849 - 201903

The above weeks should give the output as: 
201849,201850,201851,201852,201901,201902,201903

2 个答案:

答案 0 :(得分:1)

还有很多要做的优化,但是对于一般方向,您可以使用:
我在这里使用org.joda.time.format,但是java.time也应该合适。

 def rangeOfYearWeeks(weeksRange: String): Array[String] = {
  try {
    val left =  weeksRange.split("-")(0).trim
    val right = weeksRange.split("-")(1).trim

    val leftPattern  = s"${left.substring(0, 4)}-${left.substring(4)}"
    val rightPattern = s"${right.substring(0, 4)}-${right.substring(4)}"

    val fmt = DateTimeFormat.forPattern("yyyy-w")

    val leftDate  = fmt.parseDateTime(leftPattern)
    val rightDate = fmt.parseDateTime(rightPattern)
    //if (leftDate.isAfter(rightDate))
    val weeksBetween = Weeks.weeksBetween(leftDate, rightDate).getWeeks
    val dates = for (one <- 0 to weeksBetween) yield {
      leftDate.plusWeeks(one)
    }

    val result: Array[String] = dates.map(date => fmt.print(date)).map(_.replaceAll("-", "")).toArray
    result
  } catch {
    case e: Exception => Array.empty
  }
}

示例:

val dates = Seq("201849 - 201903", "201912 - 201917").toDF("col")

val weeks = udf((d: String) => rangeOfYearWeeks(d))

dates.select(weeks($"col")).show(false)

+-----------------------------------------------------+
|UDF(col)                                             |
+-----------------------------------------------------+
|[201849, 201850, 201851, 201852, 20181, 20192, 20193]|
|[201912, 201913, 201914, 201915, 201916, 201917]     |
+-----------------------------------------------------+

答案 1 :(得分:1)

以下是使用java.time API的UDF的解决方案:

def weeksBetween = udf{ (startWk: Int, endWk: Int) =>
  import java.time.LocalDate
  import java.time.format.DateTimeFormatter
  import scala.util.{Try, Success, Failure}

  def formatYW(yw: Int): String = {
    val pattern = "(\\d{4})(\\d+)".r
    s"$yw" match { case pattern(y, w) => s"$y-$w-1"}
  }

  val formatter = DateTimeFormatter.ofPattern("YYYY-w-e")  // week-based year

  Try(
    Iterator.iterate(LocalDate.parse(formatYW(startWk), formatter))(_.plusWeeks(1)).
      takeWhile(_.isBefore(LocalDate.parse(formatYW(endWk), formatter))).
      map{ s =>
        val a = s.format(formatter).split("-")
        (a(0) + f"${a(1).toInt}%02d").toInt
      }.
      toList.tail
  ) match {
    case Success(ls) => ls
    case Failure(_) => List.empty[Int]  // return an empty list
  }
}

测试UDF:

val df = Seq(
  (1, 201849, 201903), (2, 201908, 201916), (3, 201950, 201955)
).toDF("id", "start_wk", "end_wk")

df.withColumn("weeks_between", weeksBetween($"start_wk", $"end_wk")).show(false)
// +---+--------+------+--------------------------------------------------------+
// |id |start_wk|end_wk|weeks_between                                           |
// +---+--------+------+--------------------------------------------------------+
// |1  |201849  |201903|[201850, 201851, 201852, 201901, 201902]                |
// |2  |201908  |201916|[201909, 201910, 201911, 201912, 201913, 201914, 201915]|
// |3  |201950  |201955|[]                                                      |
// +---+--------+------+--------------------------------------------------------+