如何将增加的整数ID添加到Spark DStream中的项目

时间:2015-01-22 02:17:53

标签: apache-spark spark-streaming

我正在开发一个Spark Streaming应用程序,我希望在我的数据流中每个项目都有一个全局数字ID。具有间隔/ RDD本地ID是微不足道的:

dstream.transform(_.zipWithIndex).map(_.swap)

这将导致像DStream:

// key:  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8  ||  0 | 1 | 2 | 3 | 4  ||  0
// val:  a | b | c | d | e | f | g | h | i  ||  j | k | l | m | n  ||  o

(双栏||表示新RDD的开头)。

我最终想要的是:

// key:  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8  ||  9 | 10 | 11 | 12 | 13  ||  14
// val:  a | b | c | d | e | f | g | h | i  ||  j |  k |  l |  m |  n  ||   o

我怎样才能以安全和高效的方式做到这一点?

这似乎是一项微不足道的任务,但我觉得很难保持RDD之间的状态(状态="到目前为止看到的项目数量#34;)。以下是我尝试的两种方法,使用带有伪造密钥的updateStateByKey更新到目前为止看到的数量(加上当前间隔中的数字):

val intervalItemCounts = inputStream.count().map((1, _))
// intervalItemCounts looks like:
// K:  1                                  ||  1                  ||  1
// V:  9                                  ||  5                  ||  1

val updateCountState: (Seq[Long], Option[ItemCount]) => Option[ItemCount] =
  (itemCounts, maybePreviousState) => {
    val previousState = maybePreviousState.getOrElse((0L, 0L))
    val previousItemCount = previousState._2
    Some((previousItemCount, previousItemCount + itemCounts.head))
  }
val totalNumSeenItems: DStream[ItemCount] = intervalItemCounts.
  updateStateByKey(updateCountState).map(_._2)
// totalNumSeenItems looks like:
// V:  (0,9)                              ||  (9,14)             ||  (14,15)

// The first approach uses a cartesian product with the
// 1-element state DStream. (Is this performant?)
val increaseRDDIndex1: (RDD[(Long, Char)], RDD[ItemCount]) =>
  RDD[(Long, Char)] =
  (streamData, totalCount) => {
    val product = streamData.cartesian(totalCount)
    product.map(dataAndOffset => {
      val ((localIndex: Long, data: Char),
           (offset: Long, _)) = dataAndOffset
      (localIndex + offset, data)
    })
  }
val globallyIndexedItems1: DStream[(Long, Char)] = inputStream.
  transformWith(totalNumSeenItems, increaseRDDIndex1)

// The second approach uses a take() output operation on the
// 1-element state DStream beforehand. (Is this valid?? Will
// the closure be serialized and shipped in every interval?)
val increaseRDDIndex2: (RDD[(Long, Char)], RDD[ItemCount]) =>
  RDD[(Long, Char)] = (streamData, totalCount) => {
    val offset = totalCount.take(1).head._1
    streamData.map(keyValue => (keyValue._1 + offset, keyValue._2))
  }
val globallyIndexedItems2: DStream[(Long, Char)] = inputStream.
  transformWith(totalNumSeenItems, increaseRDDIndex2)

两种方法都给出了正确的结果(使用local[*] master),但我想知道性能(shuffle等),它是否在真正的分布式环境中工作以及它是否应该很多比那更容易......

0 个答案:

没有答案