如何批量处理输入记录的子集,即3秒批处理时间的第一秒?

时间:2016-01-22 01:16:58

标签: apache-spark spark-streaming

如果我在Seconds(1)中为批处理时间设置StreamingContext,请执行以下操作:

val ssc = new StreamingContext(sc, Seconds(1))

3秒将收到3秒的数据,但我只需要第一秒的数据,我可以丢弃接下来的2秒数据。那么我可以花3秒钟来处理第一秒的数据吗?

1 个答案:

答案 0 :(得分:2)

如果您跟踪计数器,可以通过updateStateByKey执行此操作,例如:

import org.apache.spark.SparkContext
import org.apache.spark.streaming.dstream.ConstantInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StreamEveryThirdApp {

  def main(args: Array[String]) {
    val sc = new SparkContext("local[*]", "Streaming Test")
    implicit val ssc = new StreamingContext(sc, Seconds(1))
    ssc.checkpoint("./checkpoint")

    // generate stream
    val inputDStream = createConstantStream

    // increase seconds counter
    val accStream = inputDStream.updateStateByKey(updateState)

    // keep only 1st second records
    val firstOfThree = accStream.filter { case (key, (value, counter)) => counter == 1}

    firstOfThree.print()

    ssc.start()
    ssc.awaitTermination()

  }

  def updateState: (Seq[Int], Option[(Option[Int], Int)]) => Option[(Option[Int], Int)] = {
    case(values, state) =>
      state match {
        // If no previous state, i.e. set first Second
        case None => Some(Some(values.sum), 1)
        // If this is 3rd second - remove state
        case Some((prevValue, 3)) => None
        // If this is not the first second - increase seconds counter, but don't calculate values
        case Some((prevValue, counter)) => Some((None, counter + 1))
    }
  }

  def createConstantStream(implicit ssc: StreamingContext): ConstantInputDStream[(String, Int)] = {
    val seq = Seq(
      ("key1", 1),
      ("key2", 3),
      ("key1", 2),
      ("key1", 2)
    )
    val rdd = ssc.sparkContext.parallelize(seq)
    val inputDStream = new ConstantInputDStream(ssc, rdd)
    inputDStream
  }
}

如果您的数据中有时间信息,您还可以使用3秒窗口stream.window(Seconds(3), Seconds(3))并根据数据中的时间信息过滤记录,这通常是首选方法