我想使用Scala计算会话事件之间的时差。
- GIVEN Source是一个csv文件,如下所示:
HEADER
"session","events","timestamp","Records"
DATA
"session_1","event_1","2015-01-01 10:10:00",100
"session_1","event_2","2015-01-01 11:00:00",500
"session_1","event_3","2015-01-01 11:30:00",300
"session_1","event_4","2015-01-01 11:45:00",300
"session_2","event_1","2015-01-01 10:10:00",100
"session_2","event_2","2015-01-01 11:00:00",500
必需的输出
HEADER
"session","events","time_spent_in_minutes","total_records"
DATA
"session_1","event_1","50",100
"session_1","event_2","30",600
"session_1","event_3","15",900
"session_1","event_4","0",1200
"session_2","event_1","50",100
"session_2","event_2","0",600
其中time_spend_in_minutes是给定会话的current_event和下一个事件之间的差异。 目标中不需要标题但是很好。
我是Scala的新手,所以我到目前为止:
$ cat test.csv
"session_1","event_1","2015-01-01 10:10:00",100
"session_1","event_2","2015-01-01 11:00:00",500
"session_1","event_3","2015-01-01 11:30:00",300
"session_1","event_4","2015-01-01 11:45:00",300
"session_2","event_1","2015-01-01 10:10:00",100
"session_2","event_2","2015-01-01 11:00:00",500
scala> val sessionFile = sc.textFile("test.csv").
map(_.split(',')).
map(e => (e(1).trim, Sessions(e(0).trim,e(1).trim,e(2).trim,e(3).trim.toInt))).
foreach(println)
("event_1",Sessions("session_2","event_1","2015-01-01 10:10:00",100))
("event_1",Sessions("session_1","event_1","2015-01-01 10:10:00",100))
("event_2",Sessions("session_2","event_2","2015-01-01 11:00:00",500))
("event_2",Sessions("session_1","event_2","2015-01-01 11:00:00",500))
("event_3",Sessions("session_1","event_3","2015-01-01 11:30:00",300))
("event_4",Sessions("session_1","event_4","2015-01-01 11:45:00",300))
sessionFile: Unit = ()
scala>
答案 0 :(得分:3)
这是一个使用joda时间库的解决方案。
val input =
""""session_1","event_1","2015-01-01 10:10:00",100
"session_1","event_2","2015-01-01 11:00:00",500
"session_1","event_3","2015-01-01 11:30:00",300
"session_1","event_4","2015-01-01 11:45:00",300
"session_2","event_1","2015-01-01 10:10:00",100
"session_2","event_2","2015-01-01 11:00:00",500"""
从文本输入创建RDD,可以使用sc.textFile
import org.joda.time.format._
import org.joda.time._
def strToTime(s: String):Long = {
DateTimeFormat.forPattern(""""yyyy-MM-dd HH:mm:ss"""")
.parseDateTime(s).getMillis()/1000
}
val r1 = sc.parallelize(input.split("\n"))
.map(_.split(","))
.map(x => (x(0), (x(1), x(2), x(3))))
.groupBy(_._1)
.map(_._2.map{ case(s, (e, timestr, r)) =>
(s, (e, strToTime(timestr), r))}
.toArray
.sortBy( z => z match {
case (session, (event, time, records)) => time}))
将时间从“2015-01-01 10:10:00”转换为纪元的秒数,并按时间排序。
val r2 = r1.map(x => x :+ { val y = x.last;
y match {
case (session, (event, time, records)) =>
(session, (event, time, "0")) }})
在每个会话中添加了一个额外的事件,除了记录计数之外,所有参数都与会话的最后一个事件相同。 这允许持续时间计算在最后一次事件中提供“0”。
使用sliding
获取事件对。
val r3 = r2.map(x => x.sliding(2).toArray)
val r4 = r3.map(x => x.map{
case Array((s1, (e1, t1, c1)), (s2, (e2, t2, c2))) =>
(s1, (e1, (t2 - t1)/60, c1)) } )
使用scan
以增量方式添加记录计数。
val r5 = r4.map(x => x.zip(x.map{ case (s, (e, t, r)) => r.toInt}
.scan(0)(_+_)
.drop(1)))
val r6 = r5.map(x => x.map{ case ((s, (e, t, r)), recordstillnow) =>
s"${s},${e},${t},${recordstillnow}" })
val r7 = r6.flatMap(x => x)
r7.collect.mkString("\n")
//"session_2","event_1",50,100
//"session_2","event_2",0,600
//"session_1","event_1",50,100
//"session_1","event_2",30,600
//"session_1","event_3",15,900
//"session_1","event_4",0,1200
答案 1 :(得分:0)
尝试这样的事情:
import org.joda.time.format._
import org.joda.time._
val d1 = DateTime.parse("2015-03-03", DateTimeFormat.forPattern("yyyy-MM-dd"))
val d2 = DateTime.parse("2015-03-04", DateTimeFormat.forPattern("yyyy-MM-dd"))
d1.getMillis() - d2.getMillis()