如何在新列scala的内存转换中应用窗口函数

时间:2020-10-16 21:48:20

标签: scala apache-spark apache-spark-sql

我有一个要转换为以下输出的数据帧,其中每行start_duration和end_duration将由前一行start_duration和end_duration产生,请让我知道如何使用scala在spark中实现。

下面是计算start_duration和end_duration的公式:

describeConstable()

下面是我的输入数据框:

start_duration = max(previous end_duration + 1, current date); 
end_duration = min(presciption_end date, start_duration + duration – 1)

预期的输出数据帧:

+--------

--------+-----------+---------+-----------+----------------+----------+--------+----------+----------+
|prescription_uid|patient_uid|ndc      |label      |dispensation_uid|date      |duration|start_date|end_date  |
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+
|0               |0          |16714-128|sinvastatin|0               |2015-06-10|30      |2015-06-01|2015-12-01|
|0               |0          |16714-128|sinvastatin|1               |2015-07-15|30      |2015-06-01|2015-12-01|
|0               |0          |16714-128|sinvastatin|2               |2015-08-01|30      |2015-06-01|2015-12-01|
|0               |0          |16714-128|sinvastatin|3               |2015-10-01|30      |2015-06-01|2015-12-01|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+

UDF:

EXPECTED RESULT:
    +--------
    
        --------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
        |prescription_uid|patient_uid|ndc      |label      |dispensation_uid|date      |duration|start_date|end_date  |first_start_duration|first_end_duration|start_duration|end_duration|
        +----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
        |0               |0          |16714-128|sinvastatin|0               |2015-06-10|30      |2015-06-01|2015-12-01|2015-06-10          |2015-07-09        |2015-06-10    |2015-07-09  |
        |0               |0          |16714-128|sinvastatin|1               |2015-07-15|30      |2015-06-01|2015-12-01|2015-06-10          |2015-07-09        |2015-07-15    |2015-08-13  |
        |0               |0          |16714-128|sinvastatin|2               |2015-08-01|30      |2015-06-01|2015-12-01|2015-06-10          |2015-07-09        |2015-08-14    |2015-09-13  |
        |0               |0          |16714-128|sinvastatin|3               |2015-10-01|30      |2015-06-01|2015-12-01|2015-06-10          |2015-07-09        |2015-10-01    |2015-10-30  |
        +----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
    
Code tried : 
val windowByPatient = Window.partitionBy($"patient_uid").orderBy($"date")
    val windowByPatientBeforeCurrentRow = windowByPatient.rowsBetween(Window.unboundedPreceding, -1)
    joinedPrDF = joinedPrDF
      .withColumn("first_start_duration", firstStartDuration(first($"date").over(windowByPatient), $"start_date"))
      .withColumn("first_end_duration", firstEndDuration($"first_start_duration", $"end_date", $"duration"))
      .withColumn("start_duration", when(count("*").over(windowByPatient) === 1, $"first_start_duration")
        .otherwise(startDurationCalc($"first_start_duration", $"date", $"start_date", coalesce(sum($"duration").over(windowByPatientBeforeCurrentRow), lit("0")))))
      .withColumn("end_duration", when(count("*").over(windowByPatient) === 1, $"first_end_duration")
        .otherwise(endDurationCalc($"end_date", $"start_duration", $"duration")))

2 个答案:

答案 0 :(得分:0)

您不应该期望窗口函数对数据帧中不存在但在执行期间进行计算的数据进行计算(您称其为“在内存行中”)。这是不可能的。

您可以尝试其他方法。根据{{​​1}},计算出第一个start_duration的形式(您可以考虑可能存在的差距)。

duration

val windowByPatient = Window.partitionBy("patient_uid").orderBy("date") val windowByPatientBeforeCurrentRow = windowByPatient.rowsBetween(Window.unboundedPreceding, -1) data .withColumn("previous_date", lag("date", 1).over(windowByPatient)) .withColumn("diff_from_prev", datediff(col("date"), coalesce(col("previous_date"), col("date")))) .withColumn("diff_with_duration", when(col("diff_from_prev") >= lag("duration", 1, 0).over(windowByPatient), col("diff_from_prev")).otherwise(col("duration"))) .withColumn("first_date_by_patient", first("date").over(windowByPatient)) .withColumn("duration_from_first_with_gaps", col("diff_with_duration") + coalesce(sum("diff_from_prev").over(windowByPatientBeforeCurrentRow), lit("0"))) .withColumn("start_duration", expr("date_add(first_date_by_patient, duration_from_first_with_gaps)")) .withColumn("end_duration", expr("date_add(start_duration, duration - 1)")) .select((data.columns ++ Seq("start_duration", "end_duration")).map(col): _*) .show() 被包装在date_add中,因为它将expr作为第二个参数,但可以与sql上下文中的列一起使用。

答案 1 :(得分:0)

下面是使用上一个持续时间和上一个分配日期的滞后窗口功能的最终开始持续时间计算器:

val startDurationCalc = udf((currentDsDate: java.sql.Date, prevDsDate: java.sql.Date, prevDuration: Int, prsEndDate: java.sql.Date,
                                 firstStrtDur:java.sql.Date,acDuration:Int) => {
      println("startDurationCalc===currentDsDate===" + currentDsDate + "===prevDsDate===" + prevDsDate +
        "===prevDuration===" + prevDuration +"===prsEndDate==="+prsEndDate+"===firstStrtDur=="+firstStrtDur+"===acDuration==="+acDuration)
      val prevDurStartDate = prevDsDate.toLocalDate.plusDays(prevDuration - 1)
      var derivedDsDate = java.sql.Date.valueOf(prevDurStartDate.plusDays(1))
      val accumulatedDSDate = java.sql.Date.valueOf(firstStrtDur.toLocalDate.plusDays(acDuration))

      if (derivedDsDate.before(accumulatedDSDate)) {
        derivedDsDate = accumulatedDSDate
      }

      if (derivedDsDate.after(prsEndDate)) {
        val derPrsEndDate = java.sql.Date.valueOf(prsEndDate.toLocalDate.plusDays(1))
        derPrsEndDate
      } else {
        if (currentDsDate.after(derivedDsDate)) {
          currentDsDate
        } else {
          derivedDsDate
        }
      }
    }: java.sql.Date).asNondeterministic()