我有一个包含以下架构的Dataframe:
root
|-- eventTimestamp: long
|-- trackingId: string
|-- voyageStatus: string
以下是一些示例行:
+--------------+----------+------------+
|eventTimestamp|trackingId|voyageStatus|
+--------------+----------+------------+
| 504 |78911c81 |COMPLETE |
| 504 |3b77a150 |ACTIVE |
| 390 |ece6c8d0 |ACTIVE |
| 390 |78911c81 |ACTIVE |
| 349 |3b77a150 |ACTIVE |
| 349 |ece6c8d0 |ACTIVE |
| 349 |78911c81 |ACTIVE |
| 350 |3b77a150 |ACTIVE |
| 350 |ece6c8d0 |ACTIVE |
| 350 |78911c81 |ACTIVE |
| 351 |3b77a150 |ACTIVE |
| 351 |ece6c8d0 |ACTIVE |
| 351 |78911c81 |ACTIVE |
| 352 |3b77a150 |ACTIVE |
| 352 |ece6c8d0 |ACTIVE |
| 352 |78911c81 |ACTIVE |
| 507 |3b77a150 |COMPLETE |
| 349 |ece6c8d0 |ACTIVE |
| 349 |78911c81 |ACTIVE |
| 349 |3b77a150 |ACTIVE |
+--------------+----------+------------+
我想添加一个名为completionEventTimestamp
的long类型的新列。对于每一行,此列将具有以下值:
trackingId
与当前行的voyageStatus
相同,且值"COMPLETE"
等于eventTimestamp
,那么该值将为+--------------+----------+------------+------------------------+
|eventTimestamp|trackingId|voyageStatus|completionEventTimestamp|
+--------------+----------+------------+------------------------+
| 504 |78911c81 |COMPLETE | 504|
| 504 |3b77a150 |ACTIVE | 507|
| 390 |ece6c8d0 |ACTIVE | -1|
| 390 |78911c81 |ACTIVE | 504|
| 349 |3b77a150 |ACTIVE | 507|
| 349 |ece6c8d0 |ACTIVE | -1|
| 349 |78911c81 |ACTIVE | 504|
| 350 |3b77a150 |ACTIVE | 507|
| 350 |ece6c8d0 |ACTIVE | -1|
| 350 |78911c81 |ACTIVE | 504|
| 351 |3b77a150 |ACTIVE | 507|
| 351 |ece6c8d0 |ACTIVE | -1|
| 351 |78911c81 |ACTIVE | 504|
| 352 |3b77a150 |ACTIVE | 507|
| 352 |ece6c8d0 |ACTIVE | -1|
| 352 |78911c81 |ACTIVE | 504|
| 507 |3b77a150 |COMPLETE | 507|
| 349 |ece6c8d0 |ACTIVE | -1|
| 349 |78911c81 |ACTIVE | 504|
| 349 |3b77a150 |ACTIVE | 507|
+--------------+----------+------------+------------------------+
记录。以下是上述示例的转换:
trackingId
如果这有帮助,如果给定voyageStatus
的记录的"COMPLETE"
trackingId
,那么它将是eventTimestamp
的最后一条记录(如果你要按setState
)订购,那么只会有一个这样的记录。
答案 0 :(得分:0)
val completedVoyagesDF = training3.filter(training3("voyageStatus") === "COMPLETED").select("trackingID", "statusTimestamp")
val completedVoyagesArray = completedVoyagesDF.collect().map({
row: Row => row.getString(0) -> row.getLong(1)
})
val trackingIDToActualArrivalTime = completedVoyagesArray.toMap
val arrivalTime: (String => Long) = (trackingId: String) => {
trackingIDToActualArrivalTime.getOrElse(trackingId, -1)
}
val arrivalTimeFunc = udf(arrivalTime)
val withActualArrivalTimeDF = training3.withColumn(LABEL_COL_NAME, arrivalTimeFunc(col("trackingId")))
val training4 = withActualArrivalTimeDF.filter(withActualArrivalTimeDF(LABEL_COL_NAME) =!= -1)
答案 1 :(得分:0)
您可以使用collect_list
window
分区来保存每trackingId
和UDF
的状态列表,以有条件地将值分配给completionEventTimestamp
,如下所示:
val df = Seq(
(504L, 10, "ACTIVE"),
(506L, 10, "ACTIVE"),
(510L, 10, "COMPLETE"),
(390L, 11, "ACTIVE"),
(395L, 11, "ACTIVE"),
(398L, 11, "ACTIVE"),
(352L, 12, "ACTIVE"),
(360L, 12, "COMPLETE")
).toDF("eventTimestamp", "trackingId", "voyageStatus")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
// Save "completeTimestamp" for every row with COMPLETE status
val df2 = df.select(
$"eventTimestamp".as("completeTimestamp"), $"trackingId"
).where(df("voyageStatus") === "COMPLETE")
// Create a "statusList" per "trackingId" for each row using collect_list over window partitions
val window = Window.partitionBy("trackingId")
val df3 = df.withColumn("statusList", collect_list("voyageStatus").over(window))
// A UDF to check whether statusList contains "COMPLETE"
val checkComplete = udf(
(l: Seq[String]) => l.contains("COMPLETE")
)
// Join df3 with df2 and apply the UDF to assemble "completionEventTimestamp"
val df4 = df3.join(df2, Seq("trackingId"), "left_outer").
withColumn(
"completionEventTimestamp",
when(checkComplete($"statusList"), $"completeTimestamp").otherwise(-1L)
).select(
"eventTimestamp", "trackingId", "voyageStatus", "completionEventTimestamp"
)
df4.show
+--------------+----------+------------+------------------------+
|eventTimestamp|trackingId|voyageStatus|completionEventTimestamp|
+--------------+----------+------------+------------------------+
| 352| 12| ACTIVE| 360|
| 360| 12| COMPLETE| 360|
| 504| 10| ACTIVE| 510|
| 506| 10| ACTIVE| 510|
| 510| 10| COMPLETE| 510|
| 390| 11| ACTIVE| -1|
| 395| 11| ACTIVE| -1|
| 398| 11| ACTIVE| -1|
+--------------+----------+------------+------------------------+