Question

我正在开发Spark非流应用程序，在该应用程序中我必须使用特定的偏移量集。所以我正在使用以下方法读取数据-

val startingOffsets = """{"topic_name": { "0": 33490, "1": 557900, "2": -2} }"""
val endingOffsets =  """{"topic_name": { "0": 33495, "1": 557905, "2": -1} }"""

val df = sparkSession
        .read
        .format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
        .option("kafka.bootstrap.servers", "kafka.brokers".getConfigValue) 
        .option("subscribe", "kafka.devicelocationdatatopic".getConfigValue) 
        .option("startingOffsets", "kafka.startingOffsets".getConfigValue)
        .option("endingOffsets", "kafka.endingOffsets".getConfigValue)
        .option("failOnDataLoss", "false") // any failure regarding data loss in topic or else, not supposed to fail, it has to continue...
        .option("maxOffsetsPerTrigger", "3") // any change please remove the checkpoint folder
        .load()

＆以

的形式写给cassandra

df
.write
.cassandraFormat(
"tbl_name",
"cassandra.keyspace".getConfigValue,
"cassandra.clustername".getConfigValue
 ).mode(SaveMode.Append).option("checkpointLocation", checkpointDirectory).save()

现在，我的问题是如何像我给“ maxOffsetsPerTrigger”一样控制偏移量大小，但是它没有占用。其次，如果我想在两者之间停止我的批次，那么如何优雅地在该特定批次本身上停止

我尝试使用

sc.stop()

但是它仅在完整的应用程序执行中起作用，但是我想在我之间适当地停止我的应用程序，比如说我有10个批处理，因此，如果我想在执行5个批处理后停止，那么它应该在那里停止。如何在执行之间生成此停止触发器。

Spark批处理作业正常停止在YARN

0 个答案: