我需要有关如何为每个事件获取未来3个事件的建议,请参阅以下输入和输出。
输入
+-------+-----+--------------------+--------------------+
|eventId|incId| eventDate| incDate|
+-------+-----+--------------------+--------------------+
| 1| 123|2018-02-09 10:01:...|2018-02-09 10:02:...|
| 2| 0|2018-02-09 10:02:...| null|
| 3| 124|2018-02-09 10:03:...|2018-02-09 10:03:...|
| 4| 0|2018-02-09 10:04:...| null|
| 5| 125|2018-02-09 10:05:...|2018-02-10 11:03:...|
| 6| 0|2018-02-09 10:06:...| null|
| 7| 126|2018-02-09 10:07:...|2018-02-10 11:04:...|
| 8| 127|2018-02-09 10:08:...|2018-02-10 09:05:...|
| 9| 0|2018-02-09 10:09:...| null|
| 10| 0|2018-02-10 11:30:...| null|
| 11| 0|2018-02-10 11:40:...| null|
+-------+-----+--------------------+--------------------+
可以从 创建 输入
预期输出 有关如何获得输出的说明 这意味着,为每个eventId(eventDate< IncDate)获取接下来的3个incId。
PS:我已经尝试过Window.partitionBy等,但无法获得任何正确的结果。val df=sc.parallelize(
| Seq((1,123,"2/9/2018 10:01:00","2/9/2018 10:02:00"),
| (2,0,"2/9/2018 10:02:00",""),
| (3,124,"2/9/2018 10:03:00","2/9/2018 10:03:00"),
| (4,0,"2/9/2018 10:04:00",""),
| (5,125,"2/9/2018 10:05:00","2/10/2018 11:03:00"),
| (6,0,"2/9/2018 10:06:00",""),
| (7,126,"2/9/2018 10:07:00","2/10/2018 11:04:00"),
| (8,127,"2/9/2018 10:08:00","2/10/2018 09:05:00"),
| (9,0,"2/9/2018 10:09:00",""),
| (10,0,"2/10/2018 11:30:00",""),
| (11,0,"2/10/2018 11:40:00","")
| )).toDF("eventId","incId","eventDate1","incDate1").withColumn("eventDate", from_unixtime(unix_timestamp(col("eventDate1"),"MM/dd/yyyy HH:mm:ss")).cast("timestamp")).withColumn("incDate", from_unixtime(unix_timestamp(col("incDate1"),"MM/dd/yyyy HH:mm:ss")).cast("timestamp")).drop("eventDate1","incDate1")
+-------+--------------------+-----+--------------------+----+----+----+
|eventId| eventDate|incId| incDate|inc1|inc2|inc3|
+-------+--------------------+-----+--------------------+----+----+----+
| 1|2018-02-09 10:01:...| 123|2018-02-09 10:02:...| 124| 127| 125|
| 2|2018-02-09 10:02:...| 0| null| 124| 127| 125|
| 3|2018-02-09 10:03:...| 124|2018-02-09 10:03:...| 127| 125| 126|
| 4|2018-02-09 10:04:...| 0| null| 125| 126|null|
| 5|2018-02-09 10:05:...| 125|2018-02-10 11:03:...| 125| 126|null|
| 6|2018-02-09 10:06:...| 0| null| 125| 126|null|
| 7|2018-02-09 10:07:...| 126|2018-02-10 11:04:...| 125| 126|null|
| 8|2018-02-09 10:08:...| 127|2018-02-10 09:05:...| 125| 126|null|
| 9|2018-02-09 10:09:...| 0| null| 125| 126|null|
+-------+--------------------+-----+--------------------+----+----+----+
答案 0 :(得分:0)
我想我想出了一些东西,但它是一个交叉连接,对于大型数据集我不认为这会起作用,但如果有人有更好的想法,请回复
<强>答案强>
val w= Window.orderBy(col("eventDate"))
val df2=df.filter(col("incId").notEqual("0")).toDF("eventId_1", "incId_1", "eventDate_1", "incDate_1")
val df3=df1.crossJoin(df2)
import org.apache.spark.sql.expressions.Window
val w2=Window.partitionBy(col("eventId")).orderBy("incDate_1")
val df4=df3.where(col("eventDate").cast("long") <= col("incDate_1").cast("long") ).withColumn("inc1",lead(col("incId_1"),1).over(w2)).withColumn("inc2",lead(col("incId_1"),2).over(w2)).withColumn("inc3",lead(col("incId_1"),3).over(w2)).withColumn("incDate1",lead(col("incDate_1"),1).over(w2)).withColumn("incDate2",lead(col("incDate_1"),2).over(w2)).withColumn("incDate3",lead(col("incDate_1"),3).over(w2))
val w3=Window.partitionBy(col("eventId")).orderBy("incDate_1")
val df5=df4.withColumn("rn",row_number.over(w3)).where(col("rn")===1).select("eventId","eventDate","incId","incDate", "inc1", "inc2", "inc3", "incDate1","incDate2", "incDate3").orderBy("eventId").show