我试图在序列化的上下文中理解Spark Streaming的RDD转换和检查点。请考虑以下示例Spark Streaming app:
private val helperObject = HelperObject()
private def createStreamingContext(): StreamingContext = {
val conf = new SparkConf()
.setAppName(Constants.SparkAppName)
.setIfMissing("spark.master", Constants.SparkMasterDefault)
implicit val streamingContext = new StreamingContext(
new SparkContext(conf),
Seconds(Constants.SparkStreamingBatchSizeDefault))
val myStream = StreamUtils.createStream()
myStream.transform(transformTest(_)).print()
streamingContext
}
def transformTest(rdd: RDD[String]): RDD[String] = {
rdd.map(str => helperObject.doSomething(str))
}
val ssc = StreamingContext.getOrCreate(Settings.progressDir,
createStreamingContext)
ssc.start()
while (true) {
helperObject.setData(...)
}
根据我在其他SO帖子中阅读的内容,在驱动程序上将会在流式启动后的每个批次中调用transformTest
一次。假设调用了createStreamingContext
(没有可用的检查点),我希望每批定义的helperObject
实例将被序列化为工作者一次,因此通过{获取应用于它的更改{1}}。 是这种情况吗?
现在,如果调用helperObject.setData(...)
而不是(检查点 可用),那么我希望createStreamingContext
的实例不可能因为如果helperObject
未执行,就无法捕获,因此每个批次都会被选中。 Spark Streaming必须序列化createStreamingContext
作为检查点的一部分,对吗?
使用检查点时,是否可以在驱动程序执行期间更新helperObject
?如果是这样,最好的方法是什么?
答案 0 :(得分:0)
如果要将helperObject序列化到每个执行者?
答:是的。
val helperObject = Instantiate_SomeHow()
rdd.map{_.SomeFunctionUsing(helperObject)}
Spark Streaming必须将序列化的helperObject作为检查点的一部分,对吗?
答案是。
如果您希望刷新每个RDD操作的helperObject
行为,您仍然可以通过使helperObject
更加智能化而不是直接发送helperObject
但通过具有以下签名() => helperObject_Class
。
因为它是一个可序列化的函数。这是一种非常常见的设计模式,用于发送不可序列化的对象,例如database connection object
或您有趣的用例。
示例来自Kafka Exactly once semantics using database
package example
import kafka.serializer.StringDecoder
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import scalikejdbc._
import com.typesafe.config.ConfigFactory
import org.apache.spark.{SparkContext, SparkConf, TaskContext}
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka.{KafkaUtils, HasOffsetRanges, OffsetRange}
/** exactly-once semantics from kafka, by storing offsets in the same transaction as the results
Offsets and results will be stored per-batch, on the driver
*/
object TransactionalPerBatch {
def main(args: Array[String]): Unit = {
val conf = ConfigFactory.load
val kafkaParams = Map(
"metadata.broker.list" -> conf.getString("kafka.brokers")
)
val jdbcDriver = conf.getString("jdbc.driver")
val jdbcUrl = conf.getString("jdbc.url")
val jdbcUser = conf.getString("jdbc.user")
val jdbcPassword = conf.getString("jdbc.password")
val ssc = setupSsc(kafkaParams, jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)()
ssc.start()
ssc.awaitTermination()
}
def setupSsc(
kafkaParams: Map[String, String],
jdbcDriver: String,
jdbcUrl: String,
jdbcUser: String,
jdbcPassword: String
)(): StreamingContext = {
val ssc = new StreamingContext(new SparkConf, Seconds(60))
SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
// begin from the the offsets committed to the database
val fromOffsets = DB.readOnly { implicit session =>
sql"select topic, part, off from txn_offsets".
map { resultSet =>
TopicAndPartition(resultSet.string(1), resultSet.int(2)) -> resultSet.long(3)
}.list.apply().toMap
}
val stream: InputDStream[(String,Long)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, Long)](
ssc, kafkaParams, fromOffsets,
// we're just going to count messages per topic, don't care about the contents, so convert each message to (topic, 1)
(mmd: MessageAndMetadata[String, String]) => (mmd.topic, 1L))
stream.foreachRDD { rdd =>
// Note this block is running on the driver
// Cast the rdd to an interface that lets us get an array of OffsetRange
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// simplest possible "metric", namely a count of messages per topic
// Notice the aggregation is done using spark methods, and results collected back to driver
val results = rdd.reduceByKey {
// This is the only block of code running on the executors.
// reduceByKey did a shuffle, but that's fine, we're not relying on anything special about partitioning here
_+_
}.collect
// Back to running on the driver
// localTx is transactional, if metric update or offset update fails, neither will be committed
DB.localTx { implicit session =>
// store metric results
results.foreach { pair =>
val (topic, metric) = pair
val metricRows = sql"""
update txn_data set metric = metric + ${metric}
where topic = ${topic}
""".update.apply()
if (metricRows != 1) {
throw new Exception(s"""
Got $metricRows rows affected instead of 1 when attempting to update metrics for $topic
""")
}
}
// store offsets
offsetRanges.foreach { osr =>
val offsetRows = sql"""
update txn_offsets set off = ${osr.untilOffset}
where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset}
""".update.apply()
if (offsetRows != 1) {
throw new Exception(s"""
Got $offsetRows rows affected instead of 1 when attempting to update offsets for
${osr.topic} ${osr.partition} ${osr.fromOffset} -> ${osr.untilOffset}
Was a partition repeated after a worker failure?
""")
}
}
}
}
ssc
}
}