Question

抱歉新手问题。

目前我的日志文件包含以下字段：userId，event和timestamp，而缺少sessionId。我的目标是根据时间戳和预定义的值TIMEOUT为每条记录创建一个sessionId。

如果TIMEOUT值为10，则示例DataFrame为：

scala> eventSequence.show(false)

  +----------+------------+----------+ 
  |uerId     |event       |timestamp |
  +----------+------------+----------+ 
  |U1        |A           |1         | 
  |U2        |B           |2         |
  |U1        |C           |5         |
  |U3        |A           |8         |
  |U1        |D           |20        |
  |U2        |B           |23        |
  +----------+------------+----------+

目标是：

  +----------+------------+----------+----------+
  |uerId     |event       |timestamp |sessionId |
  +----------+------------+----------+----------+
  |U1        |A           |1         |S1        |
  |U2        |B           |2         |S2        |
  |U1        |C           |5         |S1        |
  |U3        |A           |8         |S3        |
  |U1        |D           |20        |S4        |
  |U2        |B           |23        |S5        |
  +----------+------------+----------+----------+

我在R（Create a "sessionID" based on "userID" and differences in "timeStamp"）中找到了一个解决方案，而我无法在Spark中找到它。

感谢您就此问题提出任何建议。

Answer 1

Shawn回答了关于“如何创建新列”的问题，而我的目标是“如何基于时间戳创建sessionId列”。经过几天的挣扎， Window 功能在此场景中应用为一个简单的解决方案。

Window自Spark 1.4开始引入，它在需要此类操作时提供功能：

都在一组行上操作，同时仍为每个输入行返回单个值

为了基于时间戳创建sessionId，首先我需要区分用户A的两个立即操作。 windowDef定义Window将按“userId”分区并按时间戳排序，然后diff是一个列，它将返回每行的值，其值将是分区（组）中当前行之后的1行，或者为null如果当前行是此分区中的最后一行

def handleDiff(timeOut: Int) = {
  udf {(timeDiff: Int, timestamp: Int) => if(timeDiff > timeOut) timestamp + ";" else timestamp + ""}
}
val windowDef = Window.partitionBy("userId").orderBy("timestamp")
val diff: Column = lead(eventSequence("timestamp"), 1).over(windowDef)
val dfTSDiff = eventSequence.
withColumn("time_diff", diff - eventSequence("timestamp")).
withColumn("event_seq", handleDiff(TIME_OUT)(col("time_diff"), col("timestamp"))).
groupBy("userId").agg(GroupConcat(col("event_seq")).alias("event_seqs"))

更新：然后利用Window函数来应用类似“cumsum”的操作（在Pandas中提供）：

// Define a Window, partitioned by userId (partitionBy), ordered by timestamp (orderBy), and delivers all rows before current row in this partition as frame (rowsBetween)
val windowSpec = Window.partitionBy("userId").orderBy("timestamp").rowsBetween(Long.MinValue, 0)
val sessionDf = dfTSDiff.
  withColumn("ts_diff_flag", genTSFlag(TIME_OUT)(col("time_diff"))).
  select(col("userId"), col("eventSeq"), col("timestamp"), sum("ts_diff_flag").over(windowSpec).alias("sessionInteger")).
  withColumn("sessionId", genSessionId(col("userId"), col("sessionInteger")))

以前：然后用“;”分开并获取每个会话，创建一个sessionId;然后由“，”拆分，并爆炸到最终结果。因此，sessionId是在字符串操作的帮助下创建的。（这部分应该由累积和操作代替，但是我没有找到一个好的解决方案）

欢迎任何关于这个问题的想法或想法。

可以在这里找到GroupConcat：SPARK SQL replacement for mysql GROUP_CONCAT aggregate function

参考：databricks introduction

Answer 2

dt.withColumn（＆＃39; sessionId＆＃39;，expression for the new column sessionId）
例如：
dt.timestamp +预定义值TIMEOUT

Spark：如何基于userId和timestamp创建sessionId

2 个答案: