Spark:根据时间戳创建一个sessionId

时间:2017-08-18 06:32:13

标签: scala apache-spark spark-dataframe

我喜欢做以下转型。给定一个记录用户是否已记录的数据框。我的目标是根据时间戳和预定义的值TIMEOUT = 20为每条记录创建一个sessionId。

会话时段定义为:[第一条记录 - >第一条记录+超时]

例如,原始DataFrame如下所示:

scala> val df = sc.parallelize(List(
  ("user1",0),
  ("user1",3),
  ("user1",15),
  ("user1",22),
  ("user1",28),
  ("user1",41),
  ("user1",45),
  ("user1",85),
  ("user1",90)
)).toDF("user_id","timestamp")

df:org.apache.spark.sql.DataFrame = [user_id:string,timestamp:int]

+-------+---------+
|user_id|timestamp|
+-------+---------+
|user1  |0        |
|user1  |3        |
|user1  |15       |
|user1  |22       |
|user1  |28       |
|user1  |41       |
|user1  |45       |
|user1  |85       |
|user1  |90       |
+-------+---------+

目标是:

+-------+---------+----------+
|user_id|timestamp|session_id|
+-------+---------+----------+
|user1  |0        |   0      |-> first record (session 0: period [0->20])
|user1  |3        |   0      |
|user1  |15       |   0      |
|user1  |22       |   1      |-> 22 not in [0->20]->new session(period 22->42)
|user1  |28       |   1      |
|user1  |41       |   1      |
|user1  |45       |   2      |-> 45 not in [22->42]->newsession(period 45->65)
|user1  |85       |   3      |
|user1  |90       |   3      |
+-------+---------+----------+

是否有任何优雅的解决方案可以解决此问题,最好是在Scala中。

提前致谢!

2 个答案:

答案 0 :(得分:0)

这可能不是一个优雅的解决方案,但这适用于给定的数据格式。

sc.parallelize(List(
      ("user1", 0),
      ("user1", 3),
      ("user1", 15),
      ("user1", 22),
      ("user1", 28),
      ("user1", 41),
      ("user1", 45),
      ("user1", 85),
      ("user1", 90))).toDF("user_id", "timestamp").map { x =>
      val userId = x.getAs[String]("user_id")
      val timestamp = x.getAs[Int]("timestamp")
      val session = timestamp / 20
      (userId, timestamp, session)
    }.toDF("user_id", "timestamp", "session").show()

<强>结果

enter image description here

您可以根据需要更改timestamp / 20

答案 1 :(得分:0)

请参阅我的代码。 这里有两个问题: 1,我觉得表现不好。 2,我使用“userid”加入,如果这不符合您的要求。您可以向timeSetFrame和newSessionSec添加具有相同值的新列。

var newSession = ss.sparkContext.parallelize(List(
  ("user1", 0),      ("user1", 3),      ("user1", 15),      ("user1", 22),
  ("user1", 28),      ("user1", 41),      ("user1", 45),      ("user1", 85),
  ("user1", 90))).zipWithIndex().toDF("tmp", "index")

val getUser_id = udf( ( s : Row) => {
  s.getString(0)
})

val gettimestamp = udf( (  s : Row) => {
  s.getInt(1)
})
val newSessionSec = newSession.withColumn( "user_id", getUser_id($"tmp"))
  .withColumn( "timestamp", gettimestamp($"tmp")).drop( "tmp")  //.show()

val timeSet : Array[Int] = newSessionSec.select("timestamp").collect().map( s=>s.getInt(0))
val timeSetFrame = ss.sparkContext.parallelize( Seq(( "user1",timeSet))).toDF( "user_id", "tset")
val newSessionThird = newSessionSec.join( timeSetFrame, Seq("user_id"), "outer")  // .show

val getSessionID = udf( ( ts: Int, aa: Seq[Int]) => {
  var result = 0
  var begin = 0
  val loop = new Breaks
  loop.breakable {
    for (time <- aa) {
      if (time > (begin + 20)) {
        begin = time
        result += 1
      }
      if (time == ts) {
        loop.break;
      }
    }
  }
  result
})
newSessionThird.withColumn( "sessionID", getSessionID( $"timestamp", $"tset")).drop("tset", "index").show()