我想计算每个id的in和out之间的时差。数据格式为:
String,Long,String,List[String]
======================================
in, time0, door1, [id1, id2, id3, id4]
out, time1, door1, [id1, id2, id3]
out, time2, door1, [id4, id5]
最后它应该以键值对结束:
{(id1, #time1-time0), (id2, #time1-time0), (id3, #time1-time0), (id4, #time2-time0), (id5, N/A)}
解决这个问题的好方法是什么?
编辑:我尝试了以下内容。case class Data(direction: String, time:Long, door:String, ids: List[String])
val data = sc.parallelize(Seq(Data("in", 5, "d1", List("id1", "id2", "id3", "id4")),Data("out", 20, "d1", List("id1", "id2", "id3")), Data("out",50, "d1", List("id4", "id5"))))
data.flatMap(x => (x.ids, x))
答案 0 :(得分:0)
scala> case class Data( direction: String, time: Long, door: String, ids: List[ String ] )
defined class Data
scala> val data = sc.parallelize( Seq( Data( "in", 5, "d1", List( "id1", "id2", "id3", "id4" ) ), Data( "out", 20, "d1", List( "id1", "id2", "id3" ) ), Data( "out",50, "d1", List( "id4", "id5" ) ) ) )
data: org.apache.spark.rdd.RDD[Data] = ParallelCollectionRDD[0] at parallelize at <console>:14
// Get an RDD entry for each ( id, data ) pair
scala> data.flatMap( x => x.ids.map( id => ( id, x ) ) )
res0: org.apache.spark.rdd.RDD[(String, Data)] = FlatMappedRDD[1] at flatMap at <console>:17
// group by id to get data's with same id's
scala> res0.groupBy( { case ( id, data ) => id } )
res1: org.apache.spark.rdd.RDD[(String, Iterable[(String, Data)])] = ShuffledRDD[3] at groupBy at <console>:19
// convert Iterable[(String, Data)] to List[Data]
scala> res1.map( { case ( id, iter ) => ( id, iter.toList.map( { case ( i, d ) => d } ) ) } )
res2: org.apache.spark.rdd.RDD[(String, List[Data])] = MappedRDD[4] at map at <console>:21
// sort list of data's by data.time
res2.map( { case ( id, list ) => ( id, list.sortBy( d => d.time ) ) } )
res3: org.apache.spark.rdd.RDD[(String, List[Data])] = MappedRDD[5] at map at <console>:23
// get the time diff by doing lastData.time - firstData.time for each id
scala> :paste
// Entering paste mode (ctrl-D to finish)
res3.map( { case ( id, list ) => {
list match {
case d :: Nil => ( id, None )
case d :: tail => ( id, Some( list.last.time - d.time ) )
case _ => ( id, None )
}
} } )
// Exiting paste mode, now interpreting.
res6: org.apache.spark.rdd.RDD[(String, Option[Long])] = MappedRDD[7] at map at <console>:25
现在,res6
拥有您想要的数据。
另外......我不确定你是如何使用direction
所以我没有使用它,修改一些代码来获得你想要的东西(我想只是最后res3
事情需要改变一点)或者你可以在这里解释一下,也许我会给你答案。如果您有任何其他疑问......请求离开。
它也可以以更简洁的方式实现......但这很难理解。这就是为什么我提供了一个冗长而简单的代码。