我使用DirectKafkaStream
API 1从Kafka读取数据,进行一些转换,更新计数然后将数据写回Kafka。实际上,这种代码的和平正在考验中:
kafkaStream[Key, Value]("test")
.map(record => (record.key(), 1))
.updateStateByKey[Int](
(numbers: Seq[Int], state: Option[Int]) =>
state match {
case Some(s) => Some(s + numbers.length)
case _ => Some(numbers.length)
}
)
.checkpoint(this)("count") {
case (save: (Key, Int), current: (Key, Int)) =>
(save._1, save._2 + current._2)
}
.map(_._2)
.reduce(_ + _)
.map(count => (new Key, new Result[Long](count.toLong)))
.toKafka(Key.Serializer.getClass.getName, Result.longKafkaSerializer.getClass.getName)
checkpoint
运算符是我创建的DStream
API的丰富内容,它实际上应该将给定DStream
个Time
的一个RDD保存到HDFS中saveAsObjectFile
。实际上,它将每60个微批(RDD)的结果保存到HDFS中。
Checkpoint执行以下操作:
def checkpoint(processor: Streaming)(name: String)(
mergeStates: (T, T) => T): DStream[T] = {
val path = processor.configuration.get[String](
"processing.spark.streaming.checkpoint-directory-prefix") + "/" +
Reflection.canonical(processor.getClass) + "/" + name + "/"
logInfo(s"Checkpoint base path is [$path].")
processor.registerOperator(name)
if (processor.fromCheckpoint && processor.restorationPoint.isDefined) {
val restorePath = path + processor.restorationPoint.get.ID.stringify
logInfo(s"Restoring from path [$restorePath].")
checkpointData = context.objectFile[T](restorePath).cache()
stream
.transform((rdd: RDD[T], time: Time) => {
val merged = rdd
.union(checkpointData)
.map[(Boolean, T)](record => (true, record))
.reduceByKey(mergeStates)
.map[T](_._2)
processor.maybeCheckpoint(name, merged, time)
merged
}
)
} else {
stream
.transform((rdd: RDD[T], time: Time) => {
processor.maybeCheckpoint(name, rdd, time)
rdd
})
}
}
有效的代码如下:
dstream.transform((rdd: RDD[T], time: Time) => {
processor.maybeCheckpoint(name, rdd, time)
rdd
})
上述代码中的dstream
变量是前一个运算符updateStateByKey
的结果,因此在updateStateByKey
之后立即调用变换。
def maybeCheckpoint(name: String, rdd: RDD[_], time: Time) = {
if (doCheckpoint(time)) {
logInfo(s"Checkpointing for operator [$name] with RDD ID of [${rdd.id}].")
val newPath = configuration.get[String](
"processing.spark.streaming.checkpoint-directory-prefix") + "/" +
Reflection.canonical(this.getClass) + "/" + name + "/" + checkpointBarcode
logInfo(s"Saving new checkpoint to [$newPath].")
rdd.saveAsObjectFile(newPath)
registerCheckpoint(name, Operator(name), time)
logInfo(s"Checkpoint completed for operator [$name].")
}
}
如您所见,大多数代码只是簿记,但有效地调用saveAsObjectFile
。
问题在于,即使是updateStateByKey
生成的RDD也应该自动保留,当每个第X个微批处理调用saveAsObjectFile
时,Spark会从头开始重新计算所有内容,从头开始流媒体工作,从卡夫卡再次阅读所有内容开始。我试图在DStreams和RDD上强制cache
或persist
使用不同级别的存储空间。
微批次:
工作22的DAG:
运行saveAsObjectFile
的作业的DAG:
可能是什么问题?
谢谢!
1使用Spark 2.1.0。
答案 0 :(得分:3)
我认为使用transform
定期检查点会导致意外的缓存行为。
相反,使用foreachRDD
执行定期检查点将使DAG保持足够稳定以有效地缓存RDD。
我几乎是积极的,这是我们不久前遇到的类似问题的解决方案。