从Spark SQL中的开始日期和结束日期获取日期系列

时间:2017-06-16 06:39:01

标签: apache-spark apache-spark-sql

我必须将包含开始日期和结束日期的日期元组转换为日期系列。

-+-----------------------------------------+
 |dateRange                                |
-+-----------------------------------------+
 |[2017-04-06 00:00:00,2017-04-05 00:00:00]|
 |[2017-04-05 00:00:00,2017-04-04 00:00:00]|
 |[2017-04-04 00:00:00,2017-04-03 00:00:00]|
 |[2017-04-03 00:00:00,2017-03-31 00:00:00]| 
 |[2017-03-31 00:00:00,2017-03-30 00:00:00]|
 |[2017-03-30 00:00:00,2017-03-29 00:00:00]|
 |[2017-03-29 00:00:00,2017-03-28 00:00:00]|
 |[2017-03-28 00:00:00,2017-03-27 00:00:00]|
 |[2017-04-06 00:00:00,2017-04-05 00:00:00]|
 |[2017-04-05 00:00:00,2017-04-04 00:00:00]|
 |[2017-04-04 00:00:00,2017-04-03 00:00:00]|
 |[2017-04-03 00:00:00,2017-03-31 00:00:00]|
 |[2017-03-31 00:00:00,2017-03-30 00:00:00]|
 |[2017-03-30 00:00:00,2017-03-29 00:00:00]|
 |[2017-03-29 00:00:00,2017-03-28 00:00:00]|
 |[2017-03-28 00:00:00,2017-03-27 00:00:00]|
 |[2017-04-06 00:00:00,2017-04-05 00:00:00]|
-+-----------------------------------------+

如何将这些元组转换为'to'到'From'日期的日期系列?

|[2017-04-03 00:00:00,2017-03-31 00:00:00]|  
转换后

应转换为

|[2017-04-03 00:00:00,2017-04-02 00:00:00,2017-04-01 00:00:00,2017-03-31 00:00:00]|  

2 个答案:

答案 0 :(得分:2)

我试过下面的代码片段并且它为我工作。

  import org.apache.spark.sql.functions._
  import org.joda.time.LocalDate
  def dayIterator(start: LocalDate, end: LocalDate) = Iterator.iterate(start)(_ plusDays 1) takeWhile (_ isBefore end)

  def dateSeries( date1 : String,date2 : String) : Array[String]= {
    val fromDate = new LocalDate(date1.split(" ")(0))
    val toDate = new LocalDate(date2.split(" ")(0))
    val series = dayIterator(fromDate,toDate).toArray
    val arr = series.map(a => a.toString() + " 00:00:00.0")
    arr
  }

  val DateSeries = udf(dateSeries(_: String, _ : String))


scala> dateSeries("2017-03-31 00:00:00.0","2017-04-03 00:00:00.0"
res53: Array[String] = Array(2017-03-31, 2017-04-01, 2017-04-02)

在dateSeries方法的地图操作中附加“00:00:00.0”后,我甚至无法弄清楚。它返回的数组没有附加的字符串。

答案 1 :(得分:1)

创建UDF并计算fromDatetoDate之间的日期可以解决问题。为简单起见,我使用了 Joda Time API 。您需要将该依赖项添加为

对于SBT:

libraryDependencies += "joda-time" % "joda-time" % "2.8.1"

以下是您的问题的示例

import spark.implicits._

    val data = spark.sparkContext.parallelize(Seq(
      ("2017-04-03 00:00:00,2017-03-31 00:00:00"),
      ("2017-03-31 00:00:00,2017-03-30 00:00:00"),
      ("2017-03-30 00:00:00,2017-03-29 00:00:00"),
      ("2017-03-29 00:00:00,2017-03-28 00:00:00"),
      ("2017-03-28 00:00:00,2017-03-27 00:00:00"),
      ("2017-04-03 00:00:00,2017-03-31 00:00:00"),
      ("2017-04-06 00:00:00,2017-04-05 00:00:00")
    )).toDF("dateRanges")


    val calculateDate = udf((date: String) => {

      val dtf = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")

        val from = dtf.parseDateTime(date.split(",")(0)).toDateTime()
        val to   = dtf.parseDateTime(date.split(",")(1)).toDateTime()
        val dates = scala.collection.mutable.MutableList[String]()
        var toDate = to
        while(from.getMillis != toDate.getMillis){
          if (from.getMillis > toDate.getMillis){
            dates += from.toString(dtf)
            toDate = toDate.plusDays(1)
          }
          else {
            dates += from.toString(dtf)
            toDate = toDate.minusDays(1)
          }
        }
      dates
    })

    data.withColumn("newDate", calculateDate(data("dateRanges")))

如果toDate小于或大于fromDate,则适用于这两种情况。

希望这有帮助!