Question

我有一个<id, action, timestamp, data>的流来处理。

例如，（为了简单起见，我们假设只有1个id）

id     event         timestamp         
-------------------------------
1      A             1                 
1      B             2                 
1      C             4                 
1      D             7                 
1      E             15
1      F             16

我们说TIMEOUT = 5。因为D发生后没有任何进一步的事件超过5秒，我想将其映射到具有两个键：值对的JavaPairDStream。

id1_1:
A             1                 
B             2                 
C             4                 
D             7

和

id1_2:
E             15
F             16

但是，在我传递给PairFunction方法的匿名函数对象mapToPair()中，

incomingMessages.mapToPair(new PairFunction<String, String, RequestData>() {
private static final long serialVersionUID = 1L;

@Override
public Tuple2<String, RequestData> call(String s) {

我无法在下一个条目中引用数据。换句话说，当我处理包含事件D的条目时，我无法查看E处的数据。

如果这不是Spark，我可以简单地创建一个数组timeDifferences，将差异存储在两个相邻的时间戳中，每当我看到timeDifferences中的时差时，就将数组分成几部分大于TIMEOUT。（虽然，实际上不需要显式创建数组）

我如何在Spark中执行此操作？

Answer 1

我仍然在努力理解你的问题，但根据你所写的内容，我认为你可以这样做：

  val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
  val B = A.map(x=>(x._1-1,x._2))
  val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))

所以概念是你用索引压缩来创建val A（它为你的RDD的每个条目分配一个序列长号），并复制RDD但是用连续条目的索引来创建val B（通过减去1）从索引），然后使用连接来计算连续条目之间的TIMEOUT。然后使用Filter。此方法使用RDD。一种更简单的方法是将它们收集到Master中并使用Map或压缩映射，但是我认为它不会是火花。

Answer 2

我相信这可以满足您的需求：

def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
    val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
    val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})

    // joining the two to attach a "followingGap" to each event
    val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
       case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
       case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
    })

    // collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
    // if this collection is very large, another join might be needed
    val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()

    // going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
    input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}

case class Event(timestamp: Long, data: String)

case class ExtendedEvent(event: Event, followingGap: Long)

第一部分以GameOfThrows的答案为基础 - 将输入与1的偏移量连接起来，计算每条记录的“followGap”。然后我们收集窗口之间的“间隔”或“截止点”，并使用这些点对输入执行另一次转换，以便按窗口对其进行分组。

注意：可能有更有效的方法来执行某些转换，具体取决于输入的特性，例如：如果您有很多“会话”，则此代码可能会很慢或内存不足。 / p>

在map函数中引用RDD中的下一个条目

2 个答案: