注意:我的分组最多可以包含每组5-10K行的聚合。因此,非常需要高效的代码。
我的数据
val df1 = sc.parallelize(Seq(
("user2", "iphone", "2017-12-23 16:58:08", "Success"),
("user2", "iphone", "2017-12-23 16:58:12", "Success"),
("user2", "iphone", "2017-12-23 16:58:20", "Success"),
("user2", "iphone", "2017-12-23 16:58:25", "Success"),
("user2", "iphone", "2017-12-23 16:58:35", "Success"),
("user2", "iphone", "2017-12-23 16:58:45", "Success")
)).toDF("username", "device", "attempt_at", "stat")
+--------+------+-------------------+-------+
|username|device| attempt_at| stat|
+--------+------+-------------------+-------+
| user2|iphone|2017-12-23 16:58:08|Success|
| user2|iphone|2017-12-23 16:58:12|Success|
| user2|iphone|2017-12-23 16:58:20|Success|
| user2|iphone|2017-12-23 16:58:25|Success|
| user2|iphone|2017-12-23 16:58:35|Success|
| user2|iphone|2017-12-23 16:58:45|Success|
+--------+------+-------------------+-------+
我想要什么
对事件发生的最新时间进行分组(用户名,设备)。
+--------+------+-------------------+-------+-------------------+
|username|device| attempt_at| stat|previous_attempt_at|
+--------+------+-------------------+-------+-------------------+
| user2|iphone|2017-12-23 16:58:45|Success|2017-12-23 16:58:35|
+--------+------+-------------------+-------+-------------------+
所需输出中的例外情况:
现在我提到它必须在特定的时间窗口中,例如在最后一行所在的输入数据集中
12月23日的最新日期时间戳。现在如果我想要一个特定的时间窗口返回1天并给我最后一次尝试,' previous_attempt_at' 列将为空,因为前一天没有事件应该是在1月22日。这一切都取决于输入时间戳范围。
//Initial Data
+--------+------+-------------------+-------+
|username|device| attempt_at| stat|
+--------+------+-------------------+-------+
| user2|iphone|2017-12-20 16:58:08|Success|
| user2|iphone|2017-12-20 16:58:12|Success|
| user2|iphone|2017-12-20 16:58:20|Success|
| user2|iphone|2017-12-20 16:58:25|Success|
| user2|iphone|2017-12-20 16:58:35|Success|
| user2|iphone|2017-12-23 16:58:45|Success|
+--------+------+-------------------+-------+
// Desired Output
A grouping by (username,device) for the latest time an event occurred.
+--------+------+-------------------+-------+-------------------+
|username|device| attempt_at| stat|previous_attempt_at|
+--------+------+-------------------+-------+-------------------+
| user2|iphone|2017-12-23 16:58:45|Success| null|
+--------+------+-------------------+-------+-------------------+
我有什么。
val w = (Window.partitionBy("username", "device")
.orderBy(col("attempt_at").cast("timestamp").cast("long"))
.rangeBetween(-3600, -1)
)
val df2 = df1.withColumn("previous_attempt_at", last("attempt_at").over(w))
+--------+------+-------------------+-------+-------------------+
|username|device| attempt_at| stat|previous_attempt_at|
+--------+------+-------------------+-------+-------------------+
| user2|iphone|2017-12-23 16:58:08|Success| null|
| user2|iphone|2017-12-23 16:58:12|Success|2017-12-23 16:58:08|
| user2|iphone|2017-12-23 16:58:20|Success|2017-12-23 16:58:12|
| user2|iphone|2017-12-23 16:58:25|Success|2017-12-23 16:58:20|
| user2|iphone|2017-12-23 16:58:35|Success|2017-12-23 16:58:25|
| user2|iphone|2017-12-23 16:58:45|Success|2017-12-23 16:58:35|
+--------+------+-------------------+-------+-------------------+
备注即可。 我的代码为特定用户分组中的每一行都进行了窗口化。 在处理大规模数据时效率非常低,也没有给出最新的尝试。我不需要除最后一行之外的所有行。
答案 0 :(得分:2)
您只需要一个额外的groupBy
和aggregation
,但在此之前,您需要 collect_list
函数来累积收集以前的日期和udf
函数来检查对于先前的attempt_at在时间限制内,并且将三列("attempt_at", "stat", "previous_attempt_at"
)转换为struct
以选择最后一个 > as
import org.apache.spark.sql.functions._
import java.time._
import java.time.temporal._
import java.time.format._
def durationUdf = udf((actualtimestamp: String, timestamps: Seq[String])=> {
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
val actualDateTime = LocalDateTime.parse(actualtimestamp, formatter)
val diffDates = timestamps.init.filter(x => LocalDateTime.from(LocalDateTime.parse(x, formatter)).until(actualDateTime, ChronoUnit.DAYS) <= 1)
if(diffDates.size > 0) diffDates.last else null
})
import org.apache.spark.sql.expressions._
val w = Window.partitionBy("username", "device").orderBy(col("attempt_at").cast("timestamp").cast("long"))
val df2 = df1.withColumn("previous_attempt_at", durationUdf(col("attempt_at"), collect_list("attempt_at").over(w)))
.withColumn("struct", struct(col("attempt_at").cast("timeStamp").as("attempt_at"),col("stat"), col("previous_attempt_at")))
.groupBy("username", "device").agg(max("struct").as("struct"))
.select(col("username"), col("device"), col("struct.attempt_at"), col("struct.stat"), col("struct.previous_attempt_at"))
这应该为您提供以用于后面的示例
+--------+------+---------------------+-------+-------------------+
|username|device|attempt_at |stat |previous_attempt_at|
+--------+------+---------------------+-------+-------------------+
|user2 |iphone|2017-12-23 16:58:45.0|Success|null |
+--------+------+---------------------+-------+-------------------+
以及之前输入的 ata
+--------+------+---------------------+-------+-------------------+
|username|device|attempt_at |stat |previous_attempt_at|
+--------+------+---------------------+-------+-------------------+
|user2 |iphone|2017-12-23 16:58:45.0|Success|2017-12-23 16:58:35|
+--------+------+---------------------+-------+-------------------+
您可以通过将ChronoUnit.DAYS
函数中的udf
更改为ChronoUnit.HOURS
来更改数小时的逻辑,依此类推