我具有按时间戳排序的以下格式的数据,每一行代表一个事件:
+----------+--------+---------+
|event_type| data |timestamp|
+----------+--------+---------+
| A | d1 | 1 |
| B | d2 | 2 |
| C | d3 | 3 |
| C | d4 | 4 |
| C | d5 | 5 |
| A | d6 | 6 |
| A | d7 | 7 |
| B | d8 | 8 |
| C | d9 | 9 |
| B | d10 | 12 |
| C | d11 | 20 |
+----------+--------+---------+
我需要将这些事件收集成系列,例如:
1. C型事件标志着系列赛的结束
2.如果有多个C类型的连续事件,它们属于同一系列,而最后一个标志着该系列的结束
3.每个系列最多可以持续7天 ,即使没有结束该事件的
请注意,一天中可能会有多个系列。实际上,“时间戳记”列是标准的UNIX时间戳记,为了简化起见,这里让数字表示日期。
因此所需的输出将如下所示:
+---------------------+--------------------------------------------------------------------+
|first_event_timestamp| events: List[(event_type, data, timestamp)] |
+---------------------+--------------------------------------------------------------------+
| 1 | List((A, d1, 1), (B, d2, 2), (C, d3, 3), (C, d4, 4), (C, d5, 5)) |
| 6 | List((A, d6, 6), (A, d7, 7), (B, d8, 8), (C, d9, 9)) |
| 12 | List((B, d10, 12)) |
| 20 | List((C, d11, 20)) |
+---------------------+--------------------------------------------------------------------+
我尝试使用Window函数解决此问题,在其中我将添加2列,如下所示:
1.使用某些唯一ID将种子列标记为事件的事件直接接在C类型的事件之后
2. SeriesId被使用last()的Seed列中的值填充,以标记具有相同id的一个系列中的所有事件
3.然后按SeriesId
不幸的是,这似乎不可能:
+----------+--------+---------+------+-----------+
|event_type| data |timestamp| seed | series_id |
+----------+--------+---------+------+-----------+
| A | d1 | 1 | null | null |
| B | d2 | 2 | null | null |
| C | d3 | 3 | null | null |
| C | d4 | 4 | 0 | 0 |
| C | d5 | 5 | 1 | 1 |
| A | d6 | 6 | 2 | 2 |
| A | d7 | 7 | null | 2 |
| B | d8 | 8 | null | 2 |
| C | d9 | 9 | null | 2 |
| B | d10 | 12 | 3 | 3 |
| C | d11 | 20 | null | 3 |
+----------+--------+---------+------+-----------+
df.withColumn(
"seed",
when(
(lag($"eventType", 1) === ventType.Conversion).over(w),
typedLit(DigestUtils.sha256Hex("some fields").substring(0, 32))
)
)
抛出
org.apache.spark.sql.AnalysisException:窗口函数中不支持表达式'(lag(eventType#76,1,null)= C)'。
我有点卡在这里,将不胜感激(最好使用Dataframe / dataset api)。
答案 0 :(得分:1)
这是方法
这里是udf,将记录标记为“开始”
//tag the starting event, based on the conditions
def tagStartEvent : (String,String,Int,Int) => String = (prevEvent:String,currEvent:String,prevTimeStamp:Int,currTimeStamp:Int)=>{
//very first event is tagged as "start"
if (prevEvent == "start")
"start"
else if ((currTimeStamp - prevTimeStamp) > 7 )
"start"
else {
prevEvent match {
case "C" =>
if (currEvent == "A")
"start"
else if (currEvent == "B")
"start"
else // if current event C
""
case _ => ""
}
}
}
val tagStartEventUdf = udf(tagStartEvent)
data.csv
event_type,data,timestamp
A,d1,1
B,d2,2
C,d3,3
C,d4,4
C,d5,5
A,d6,6
A,d7,7
B,d8,8
C,d9,9
B,d10,12
C,d11,20
val df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data.csv")
val window = Window.partitionBy("all").orderBy("timestamp")
//tag the starting event
val dfStart =
df.withColumn("all", lit(1))
.withColumn("series_start",
tagStartEventUdf(
lag($"event_type",1, "start").over(window), df("event_type"),
lag($"timestamp",1,1).over(window),df("timestamp")))
val dfStartSeries = dfStart.filter($"series_start" === "start").select(($"timestamp").as("series_start_time"),$"all")
val window2 = Window.partitionBy("all").orderBy($"series_start_time".desc)
//get the series end times
val dfSeriesTimes = dfStartSeries.withColumn("series_end_time",lag($"series_start_time",1,null).over(window2)).drop($"all")
val dfSeries =
df.join(dfSeriesTimes).withColumn("timestamp_series",
// if series_end_time is null and timestamp >= series_start_time, then series_start_time
when(col("series_end_time").isNull && col("timestamp") >= col("series_start_time"), col("series_start_time"))
// if record greater or equal to series_start_time, and less than series_end_time, then series_start_time
.otherwise(when((col("timestamp") >= col("series_start_time") && col("timestamp") < col("series_end_time")), col("series_start_time")).otherwise(null)))
.filter($"timestamp_series".isNotNull)
dfSeries.show()