Question

我试图在序列化的上下文中理解Spark Streaming的RDD转换和检查点。请考虑以下示例Spark Streaming app：

private val helperObject = HelperObject()

private def createStreamingContext(): StreamingContext = {
  val conf = new SparkConf()
    .setAppName(Constants.SparkAppName)
    .setIfMissing("spark.master", Constants.SparkMasterDefault)

  implicit val streamingContext = new StreamingContext(
    new SparkContext(conf),
    Seconds(Constants.SparkStreamingBatchSizeDefault))

  val myStream = StreamUtils.createStream()

  myStream.transform(transformTest(_)).print()

  streamingContext
}

def transformTest(rdd: RDD[String]): RDD[String] = {
  rdd.map(str => helperObject.doSomething(str))
}

val ssc = StreamingContext.getOrCreate(Settings.progressDir, 
createStreamingContext)

ssc.start()

while (true) {
  helperObject.setData(...)
}

根据我在其他SO帖子中阅读的内容，在驱动程序上将会在流式启动后的每个批次中调用transformTest一次。假设调用了createStreamingContext（没有可用的检查点），我希望每批定义的helperObject实例将被序列化为工作者一次，因此通过{获取应用于它的更改{1}}。 是这种情况吗？

现在，如果调用helperObject.setData(...) 而不是（检查点可用），那么我希望createStreamingContext的实例不可能因为如果helperObject未执行，就无法捕获，因此每个批次都会被选中。 Spark Streaming必须序列化createStreamingContext作为检查点的一部分，对吗？

使用检查点时，是否可以在驱动程序执行期间更新helperObject？如果是这样，最好的方法是什么？

Answer 1

如果要将helperObject序列化到每个执行者？

答：是的。

val helperObject = Instantiate_SomeHow()
rdd.map{_.SomeFunctionUsing(helperObject)}

Spark Streaming必须将序列化的helperObject作为检查点的一部分，对吗？

答案是。

如果您希望刷新每个RDD操作的helperObject行为，您仍然可以通过使helperObject更加智能化而不是直接发送helperObject但通过具有以下签名() => helperObject_Class。

因为它是一个可序列化的函数。这是一种非常常见的设计模式，用于发送不可序列化的对象，例如database connection object或您有趣的用例。

示例来自Kafka Exactly once semantics using database

  package example

  import kafka.serializer.StringDecoder
  import kafka.common.TopicAndPartition
  import kafka.message.MessageAndMetadata
  import scalikejdbc._
  import com.typesafe.config.ConfigFactory

  import org.apache.spark.{SparkContext, SparkConf, TaskContext}
  import org.apache.spark.SparkContext._
  import org.apache.spark.streaming._
  import org.apache.spark.streaming.dstream.InputDStream
  import org.apache.spark.streaming.kafka.{KafkaUtils, HasOffsetRanges, OffsetRange}

  /** exactly-once semantics from kafka, by storing offsets in the same transaction as the results
    Offsets and results will be stored per-batch, on the driver
    */
  object TransactionalPerBatch {
    def main(args: Array[String]): Unit = {
      val conf = ConfigFactory.load
      val kafkaParams = Map(
        "metadata.broker.list" -> conf.getString("kafka.brokers")
      )
      val jdbcDriver = conf.getString("jdbc.driver")
      val jdbcUrl = conf.getString("jdbc.url")
      val jdbcUser = conf.getString("jdbc.user")
      val jdbcPassword = conf.getString("jdbc.password")

      val ssc = setupSsc(kafkaParams, jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)()
      ssc.start()
      ssc.awaitTermination()

    }

    def setupSsc(
      kafkaParams: Map[String, String],
      jdbcDriver: String,
      jdbcUrl: String,
      jdbcUser: String,
      jdbcPassword: String
    )(): StreamingContext = {
      val ssc = new StreamingContext(new SparkConf, Seconds(60))

      SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)

      // begin from the the offsets committed to the database
      val fromOffsets = DB.readOnly { implicit session =>
        sql"select topic, part, off from txn_offsets".
          map { resultSet =>
            TopicAndPartition(resultSet.string(1), resultSet.int(2)) -> resultSet.long(3)
          }.list.apply().toMap
      }

      val stream: InputDStream[(String,Long)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, Long)](
        ssc, kafkaParams, fromOffsets,
        // we're just going to count messages per topic, don't care about the contents, so convert each message to (topic, 1)
        (mmd: MessageAndMetadata[String, String]) => (mmd.topic, 1L))

      stream.foreachRDD { rdd =>
        // Note this block is running on the driver

        // Cast the rdd to an interface that lets us get an array of OffsetRange
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

        // simplest possible "metric", namely a count of messages per topic
        // Notice the aggregation is done using spark methods, and results collected back to driver
        val results = rdd.reduceByKey {
          // This is the only block of code running on the executors.
          // reduceByKey did a shuffle, but that's fine, we're not relying on anything special about partitioning here
          _+_
        }.collect

        // Back to running on the driver

        // localTx is transactional, if metric update or offset update fails, neither will be committed
        DB.localTx { implicit session =>
          // store metric results
          results.foreach { pair =>
            val (topic, metric) = pair
            val metricRows = sql"""
  update txn_data set metric = metric + ${metric}
    where topic = ${topic}
  """.update.apply()
            if (metricRows != 1) {
              throw new Exception(s"""
  Got $metricRows rows affected instead of 1 when attempting to update metrics for $topic
  """)
            }
          }

          // store offsets
          offsetRanges.foreach { osr =>
            val offsetRows = sql"""
  update txn_offsets set off = ${osr.untilOffset}
    where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset}
  """.update.apply()
            if (offsetRows != 1) {
              throw new Exception(s"""
  Got $offsetRows rows affected instead of 1 when attempting to update offsets for
   ${osr.topic} ${osr.partition} ${osr.fromOffset} -> ${osr.untilOffset}
  Was a partition repeated after a worker failure?
  """)
            }
          }
        }
      }
      ssc
    }
  }

检验点中变换函数的序列化

1 个答案: