我有一个要转换为以下输出的数据帧,其中每行start_duration和end_duration将由前一行start_duration和end_duration产生,请让我知道如何使用scala在spark中实现。
下面是计算start_duration和end_duration的公式:
start_duration = max(previous end_duration + 1, current date);
end_duration = min(presciption_end date, start_duration + duration – 1)
输入数据帧:
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+
|prescription_uid|patient_uid|ndc |label |dispensation_uid|date |duration|start_date|end_date |
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+
|0 |0 |16714-128|sinvastatin|0 |2015-06-10|30 |2015-06-01|2015-12-01|
|0 |0 |16714-128|sinvastatin|1 |2015-07-15|30 |2015-06-01|2015-12-01|
|0 |0 |16714-128|sinvastatin|2 |2015-08-01|30 |2015-06-01|2015-12-01|
|0 |0 |16714-128|sinvastatin|3 |2015-10-01|30 |2015-06-01|2015-12-01|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+
EXPECTED RESULT:
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|prescription_uid|patient_uid|ndc |label |dispensation_uid|date |duration|start_date|end_date |first_start_duration|first_end_duration|start_duration|end_duration|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|0 |0 |16714-128|sinvastatin|0 |2015-06-10|30 |2015-06-01|2015-12-01|2015-06-10 |2015-07-09 |2015-06-10 |2015-07-09 |
|0 |0 |16714-128|sinvastatin|1 |2015-07-15|30 |2015-06-01|2015-12-01|2015-06-10 |2015-07-09 |2015-07-15 |2015-08-13 |
|0 |0 |16714-128|sinvastatin|2 |2015-08-01|30 |2015-06-01|2015-12-01|2015-06-10 |2015-07-09 |2015-08-14 |2015-09-13 |
|0 |0 |16714-128|sinvastatin|3 |2015-10-01|30 |2015-06-01|2015-12-01|2015-06-10 |2015-07-09 |2015-10-01 |2015-10-30 |
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
当上排之间存在间隙时,如上例中的第三排和最后一排的间隙一样,上一个论坛解决方案中的继续问题不起作用:
https://stackoverflow.com/questions/64396803/how-to-apply-window-function-in-memory-transformation-with-new-column-scala/64405160#64405160
答案 0 :(得分:0)
答案 1 :(得分:0)
prescription_uid,patient_uid,ndc,label,dispensation_uid,date,duration,start_date,end_date
0,0 ,16714-128,sinvastatin,0,2015-06-10,30,2015-06-01,2015-12-01
0,0 ,16714-128,sinvastatin,1,2015-07-15,30,2015-06-01,2015-12-01
0,0 ,16714-128,sinvastatin,2,2015-08-01,30,2015-06-01,2015-12-01
0,0 ,16714-128,sinvastatin,3,2015-10-01,30,2015-06-01,2015-12-01
var df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").load("file:///home/xxxx/Ram/sample_stack.csv")
var date=df.select("date").map(r=>r(0)).collect
var dt=data(0).toString
df=df.withColumn("first_start_duration",lit(dt))
val date_add = udf((x: String, y: Int) => {
val sdf = new SimpleDateFormat("yyyy-MM-dd")
var z=y-1
val result = new Date(sdf.parse(x).getTime() + TimeUnit.DAYS.toMillis(z))
sdf.format(result)
})
df=df.withColumn("first_end_duration", date_add($"first_start_duration", $"duration"))
df=df.withColumn("start_duration",df("date"))
df=df.withColumn("end_duration",date_add($"start_duration", $"duration"))
Result :
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|prescription_uid|patient_uid| ndc| label|dispensation_uid| date|duration|start_date| end_date|first_start_duration|first_end_duration|start_duration|end_duration|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
| 0| 0|16714-128|sinvastatin| 0|2015-06-10| 30|2015-06-01|2015-12-01| 2015-06-10| 2015-07-09| 2015-06-10| 2015-07-09|
| 0| 0|16714-128|sinvastatin| 1|2015-07-15| 30|2015-06-01|2015-12-01| 2015-06-10| 2015-07-09| 2015-07-15| 2015-08-13|
| 0| 0|16714-128|sinvastatin| 2|2015-08-01| 30|2015-06-01|2015-12-01| 2015-06-10| 2015-07-09| 2015-08-01| 2015-08-30|
| 0| 0|16714-128|sinvastatin| 3|2015-10-01| 30|2015-06-01|2015-12-01| 2015-06-10| 2015-07-09| 2015-10-01| 2015-10-30|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+