如何在执行期间计算的列上应用火花窗口函数

时间:2020-10-19 22:44:51

标签: scala apache-spark apache-spark-sql

我有一个要转换为以下输出的数据帧,其中每行start_duration和end_duration将由前一行start_duration和end_duration产生,请让我知道如何使用scala在spark中实现。

下面是计算start_duration和end_duration的公式:

start_duration = max(previous end_duration + 1, current date); 
end_duration = min(presciption_end date, start_duration + duration – 1)

输入数据帧:

+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+
|prescription_uid|patient_uid|ndc      |label      |dispensation_uid|date      |duration|start_date|end_date  |
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+
|0               |0          |16714-128|sinvastatin|0               |2015-06-10|30      |2015-06-01|2015-12-01|
|0               |0          |16714-128|sinvastatin|1               |2015-07-15|30      |2015-06-01|2015-12-01|
|0               |0          |16714-128|sinvastatin|2               |2015-08-01|30      |2015-06-01|2015-12-01|
|0               |0          |16714-128|sinvastatin|3               |2015-10-01|30      |2015-06-01|2015-12-01|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+

EXPECTED RESULT:
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|prescription_uid|patient_uid|ndc      |label      |dispensation_uid|date      |duration|start_date|end_date  |first_start_duration|first_end_duration|start_duration|end_duration|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|0               |0          |16714-128|sinvastatin|0               |2015-06-10|30      |2015-06-01|2015-12-01|2015-06-10          |2015-07-09        |2015-06-10    |2015-07-09  |
|0               |0          |16714-128|sinvastatin|1               |2015-07-15|30      |2015-06-01|2015-12-01|2015-06-10          |2015-07-09        |2015-07-15    |2015-08-13  |
|0               |0          |16714-128|sinvastatin|2               |2015-08-01|30      |2015-06-01|2015-12-01|2015-06-10          |2015-07-09        |2015-08-14    |2015-09-13  |
|0               |0          |16714-128|sinvastatin|3               |2015-10-01|30      |2015-06-01|2015-12-01|2015-06-10          |2015-07-09        |2015-10-01    |2015-10-30  |
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+

当上排之间存在间隙时,如上例中的第三排和最后一排的间隙一样,上一个论坛解决方案中的继续问题不起作用:

https://stackoverflow.com/questions/64396803/how-to-apply-window-function-in-memory-transformation-with-new-column-scala/64405160#64405160

2 个答案:

答案 0 :(得分:0)

将您的问题分成两部分。

1使用延迟获取上一个并导致(sample)并创建新列

2使用最低(结束时间)和最高(开始时间)进行获取。(sample link

我可以在sql中提供帮助。

答案 1 :(得分:0)

    prescription_uid,patient_uid,ndc,label,dispensation_uid,date,duration,start_date,end_date
    0,0 ,16714-128,sinvastatin,0,2015-06-10,30,2015-06-01,2015-12-01
    0,0 ,16714-128,sinvastatin,1,2015-07-15,30,2015-06-01,2015-12-01
    0,0 ,16714-128,sinvastatin,2,2015-08-01,30,2015-06-01,2015-12-01
    0,0 ,16714-128,sinvastatin,3,2015-10-01,30,2015-06-01,2015-12-01
    
    var df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").load("file:///home/xxxx/Ram/sample_stack.csv")

    var date=df.select("date").map(r=>r(0)).collect
    var dt=data(0).toString
    df=df.withColumn("first_start_duration",lit(dt))
    val date_add = udf((x: String, y: Int) => {
        val sdf = new SimpleDateFormat("yyyy-MM-dd")
        var z=y-1
        val result = new Date(sdf.parse(x).getTime() + TimeUnit.DAYS.toMillis(z))
      sdf.format(result)
    })
    df=df.withColumn("first_end_duration", date_add($"first_start_duration", $"duration"))
    df=df.withColumn("start_duration",df("date"))
    df=df.withColumn("end_duration",date_add($"start_duration", $"duration"))

Result : 

+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|prescription_uid|patient_uid|      ndc|      label|dispensation_uid|      date|duration|start_date|  end_date|first_start_duration|first_end_duration|start_duration|end_duration|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|               0|          0|16714-128|sinvastatin|               0|2015-06-10|      30|2015-06-01|2015-12-01|          2015-06-10|        2015-07-09|    2015-06-10|  2015-07-09|
|               0|          0|16714-128|sinvastatin|               1|2015-07-15|      30|2015-06-01|2015-12-01|          2015-06-10|        2015-07-09|    2015-07-15|  2015-08-13|
|               0|          0|16714-128|sinvastatin|               2|2015-08-01|      30|2015-06-01|2015-12-01|          2015-06-10|        2015-07-09|    2015-08-01|  2015-08-30|
|               0|          0|16714-128|sinvastatin|               3|2015-10-01|      30|2015-06-01|2015-12-01|          2015-06-10|        2015-07-09|    2015-10-01|  2015-10-30|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+