如何使用Scala计算两个记录之间的时间差?

时间:2015-04-09 21:29:15

标签: scala apache-spark

我想使用Scala计算会​​话事件之间的时差。

- GIVEN Source是一个csv文件,如下所示:

HEADER 
"session","events","timestamp","Records"
DATA
"session_1","event_1","2015-01-01 10:10:00",100
"session_1","event_2","2015-01-01 11:00:00",500
"session_1","event_3","2015-01-01 11:30:00",300
"session_1","event_4","2015-01-01 11:45:00",300
"session_2","event_1","2015-01-01 10:10:00",100
"session_2","event_2","2015-01-01 11:00:00",500

必需的输出

HEADER 
"session","events","time_spent_in_minutes","total_records"
DATA
"session_1","event_1","50",100
"session_1","event_2","30",600
"session_1","event_3","15",900
"session_1","event_4","0",1200
"session_2","event_1","50",100
"session_2","event_2","0",600

其中time_spend_in_minutes是给定会话的current_event和下一个事件之间的差异。 目标中不需要标题但是很好。

我是Scala的新手,所以我到目前为止:

$ cat test.csv
"session_1","event_1","2015-01-01 10:10:00",100
"session_1","event_2","2015-01-01 11:00:00",500
"session_1","event_3","2015-01-01 11:30:00",300
"session_1","event_4","2015-01-01 11:45:00",300
"session_2","event_1","2015-01-01 10:10:00",100
"session_2","event_2","2015-01-01 11:00:00",500


scala> val sessionFile = sc.textFile("test.csv").
map(_.split(',')).
map(e => (e(1).trim, Sessions(e(0).trim,e(1).trim,e(2).trim,e(3).trim.toInt))).
foreach(println)

("event_1",Sessions("session_2","event_1","2015-01-01 10:10:00",100))
("event_1",Sessions("session_1","event_1","2015-01-01 10:10:00",100))
("event_2",Sessions("session_2","event_2","2015-01-01 11:00:00",500))
("event_2",Sessions("session_1","event_2","2015-01-01 11:00:00",500))
("event_3",Sessions("session_1","event_3","2015-01-01 11:30:00",300))
("event_4",Sessions("session_1","event_4","2015-01-01 11:45:00",300))
sessionFile: Unit = ()

scala>

2 个答案:

答案 0 :(得分:3)

这是一个使用joda时间库的解决方案。

val input = 
""""session_1","event_1","2015-01-01 10:10:00",100
   "session_1","event_2","2015-01-01 11:00:00",500
   "session_1","event_3","2015-01-01 11:30:00",300
   "session_1","event_4","2015-01-01 11:45:00",300
   "session_2","event_1","2015-01-01 10:10:00",100
   "session_2","event_2","2015-01-01 11:00:00",500"""

从文本输入创建RDD,可以使用sc.textFile

从文件中读取
import org.joda.time.format._
import org.joda.time._

def strToTime(s: String):Long = { 
    DateTimeFormat.forPattern(""""yyyy-MM-dd HH:mm:ss"""")
                  .parseDateTime(s).getMillis()/1000 
}

val r1 = sc.parallelize(input.split("\n"))
           .map(_.split(","))
           .map(x => (x(0), (x(1), x(2), x(3))))
           .groupBy(_._1)
           .map(_._2.map{ case(s, (e, timestr, r)) => 
                              (s, (e, strToTime(timestr), r))}
                    .toArray
                    .sortBy( z => z match { 
                        case (session, (event, time, records)) => time}))

将时间从“2015-01-01 10:10:00”转换为纪元的秒数​​,并按时间排序。

val r2 = r1.map(x => x :+ { val y = x.last; 
                            y match { 
                            case (session, (event, time, records)) => 
                                 (session, (event, time, "0")) }})

在每个会话中添加了一个额外的事件,除了记录计数之外,所有参数都与会话的最后一个事件相同。 这允许持续时间计算在最后一次事件中提供“0”。

使用sliding获取事件对。

val r3 = r2.map(x => x.sliding(2).toArray)

val r4 = r3.map(x => x.map{ 
        case Array((s1, (e1, t1, c1)), (s2, (e2, t2, c2)))  => 
                   (s1, (e1, (t2 - t1)/60, c1)) } )

使用scan以增量方式添加记录计数。

val r5 = r4.map(x => x.zip(x.map{ case (s, (e, t, r)) => r.toInt}
                            .scan(0)(_+_)
                            .drop(1)))

val r6 = r5.map(x => x.map{ case ((s, (e, t, r)), recordstillnow) =>
                             s"${s},${e},${t},${recordstillnow}" })

val r7 = r6.flatMap(x => x)

r7.collect.mkString("\n")
//"session_2","event_1",50,100
//"session_2","event_2",0,600
//"session_1","event_1",50,100
//"session_1","event_2",30,600
//"session_1","event_3",15,900
//"session_1","event_4",0,1200

答案 1 :(得分:0)

尝试这样的事情:

import org.joda.time.format._
import org.joda.time._
val d1 = DateTime.parse("2015-03-03", DateTimeFormat.forPattern("yyyy-MM-dd"))
val d2 = DateTime.parse("2015-03-04", DateTimeFormat.forPattern("yyyy-MM-dd"))
d1.getMillis() - d2.getMillis()