打开列表,配对并计算差异

时间:2015-02-18 19:26:21

标签: scala functional-programming apache-spark

我想计算每个id的in和out之间的时差。数据格式为:

String,Long,String,List[String]
======================================
 in, time0, door1, [id1, id2, id3, id4]
out, time1, door1, [id1, id2, id3]
out, time2, door1, [id4, id5]

最后它应该以键值对结束:

{(id1, #time1-time0), (id2, #time1-time0), (id3, #time1-time0), (id4, #time2-time0), (id5, N/A)}

解决这个问题的好方法是什么?

编辑:我尝试了以下内容。

case class Data(direction: String, time:Long, door:String, ids:  List[String])
val data = sc.parallelize(Seq(Data("in", 5, "d1", List("id1", "id2", "id3", "id4")),Data("out", 20, "d1", List("id1", "id2", "id3")), Data("out",50, "d1", List("id4", "id5"))))
data.flatMap(x => (x.ids, x))

1 个答案:

答案 0 :(得分:0)

scala> case class Data( direction: String, time: Long, door: String, ids: List[ String ] )
defined class Data

scala> val data = sc.parallelize( Seq( Data( "in", 5, "d1", List( "id1", "id2", "id3", "id4" ) ), Data( "out", 20, "d1", List( "id1", "id2", "id3" ) ), Data( "out",50, "d1", List( "id4", "id5" ) ) ) )
data: org.apache.spark.rdd.RDD[Data] = ParallelCollectionRDD[0] at parallelize at <console>:14

// Get an RDD entry for each ( id, data ) pair
scala> data.flatMap( x => x.ids.map( id => ( id, x ) ) )
res0: org.apache.spark.rdd.RDD[(String, Data)] = FlatMappedRDD[1] at flatMap at <console>:17

// group by id to get data's with same id's
scala> res0.groupBy( { case ( id, data ) => id } )
res1: org.apache.spark.rdd.RDD[(String, Iterable[(String, Data)])] = ShuffledRDD[3] at groupBy at <console>:19

// convert Iterable[(String, Data)] to List[Data]
scala> res1.map( { case ( id, iter ) => ( id, iter.toList.map( { case ( i, d ) => d } ) ) } )
res2: org.apache.spark.rdd.RDD[(String, List[Data])] = MappedRDD[4] at map at <console>:21

// sort list of data's by data.time
res2.map( { case ( id, list ) => ( id, list.sortBy( d => d.time ) ) } )
res3: org.apache.spark.rdd.RDD[(String, List[Data])] = MappedRDD[5] at map at <console>:23

// get the time diff by doing lastData.time - firstData.time for each id
scala> :paste
// Entering paste mode (ctrl-D to finish)

res3.map( { case ( id, list ) => {
    list match {
        case d :: Nil => ( id, None )
        case d :: tail => ( id, Some( list.last.time - d.time ) )
        case _ => ( id, None )
    }
} } )

// Exiting paste mode, now interpreting.

res6: org.apache.spark.rdd.RDD[(String, Option[Long])] = MappedRDD[7] at map at <console>:25

现在,res6拥有您想要的数据。

另外......我不确定你是如何使用direction所以我没有使用它,修改一些代码来获得你想要的东西(我想只是最后res3事情需要改变一点)或者你可以在这里解释一下,也许我会给你答案。如果您有任何其他疑问......请求离开。

它也可以以更简洁的方式实现......但这很难理解。这就是为什么我提供了一个冗长而简单的代码。