我喜欢做以下转型。给定一个记录用户是否已记录的数据框。我的目标是根据时间戳和预定义的值TIMEOUT = 20为每条记录创建一个sessionId。
会话时段定义为:[第一条记录 - >第一条记录+超时]
例如,原始DataFrame如下所示:
scala> val df = sc.parallelize(List(
("user1",0),
("user1",3),
("user1",15),
("user1",22),
("user1",28),
("user1",41),
("user1",45),
("user1",85),
("user1",90)
)).toDF("user_id","timestamp")
df:org.apache.spark.sql.DataFrame = [user_id:string,timestamp:int]
+-------+---------+
|user_id|timestamp|
+-------+---------+
|user1 |0 |
|user1 |3 |
|user1 |15 |
|user1 |22 |
|user1 |28 |
|user1 |41 |
|user1 |45 |
|user1 |85 |
|user1 |90 |
+-------+---------+
目标是:
+-------+---------+----------+
|user_id|timestamp|session_id|
+-------+---------+----------+
|user1 |0 | 0 |-> first record (session 0: period [0->20])
|user1 |3 | 0 |
|user1 |15 | 0 |
|user1 |22 | 1 |-> 22 not in [0->20]->new session(period 22->42)
|user1 |28 | 1 |
|user1 |41 | 1 |
|user1 |45 | 2 |-> 45 not in [22->42]->newsession(period 45->65)
|user1 |85 | 3 |
|user1 |90 | 3 |
+-------+---------+----------+
是否有任何优雅的解决方案可以解决此问题,最好是在Scala中。
提前致谢!
答案 0 :(得分:0)
这可能不是一个优雅的解决方案,但这适用于给定的数据格式。
sc.parallelize(List(
("user1", 0),
("user1", 3),
("user1", 15),
("user1", 22),
("user1", 28),
("user1", 41),
("user1", 45),
("user1", 85),
("user1", 90))).toDF("user_id", "timestamp").map { x =>
val userId = x.getAs[String]("user_id")
val timestamp = x.getAs[Int]("timestamp")
val session = timestamp / 20
(userId, timestamp, session)
}.toDF("user_id", "timestamp", "session").show()
<强>结果强>
您可以根据需要更改timestamp / 20
。
答案 1 :(得分:0)
请参阅我的代码。 这里有两个问题: 1,我觉得表现不好。 2,我使用“userid”加入,如果这不符合您的要求。您可以向timeSetFrame和newSessionSec添加具有相同值的新列。
var newSession = ss.sparkContext.parallelize(List(
("user1", 0), ("user1", 3), ("user1", 15), ("user1", 22),
("user1", 28), ("user1", 41), ("user1", 45), ("user1", 85),
("user1", 90))).zipWithIndex().toDF("tmp", "index")
val getUser_id = udf( ( s : Row) => {
s.getString(0)
})
val gettimestamp = udf( ( s : Row) => {
s.getInt(1)
})
val newSessionSec = newSession.withColumn( "user_id", getUser_id($"tmp"))
.withColumn( "timestamp", gettimestamp($"tmp")).drop( "tmp") //.show()
val timeSet : Array[Int] = newSessionSec.select("timestamp").collect().map( s=>s.getInt(0))
val timeSetFrame = ss.sparkContext.parallelize( Seq(( "user1",timeSet))).toDF( "user_id", "tset")
val newSessionThird = newSessionSec.join( timeSetFrame, Seq("user_id"), "outer") // .show
val getSessionID = udf( ( ts: Int, aa: Seq[Int]) => {
var result = 0
var begin = 0
val loop = new Breaks
loop.breakable {
for (time <- aa) {
if (time > (begin + 20)) {
begin = time
result += 1
}
if (time == ts) {
loop.break;
}
}
}
result
})
newSessionThird.withColumn( "sessionID", getSessionID( $"timestamp", $"tset")).drop("tset", "index").show()