Question

我想使用spark从kafka主题加载所有记录，但是我看到的所有示例都使用spark流。我怎么能只一次从kafka加载消息？

Answer 1

确切的步骤在in the official documentation中列出，例如：

val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribePattern", "topic.*")
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load()

但是，如果源是连续流，则“所有记录”的定义就比较差，因为结果取决于执行查询时的时间点。

此外，您还应该记住，并行性受Kafka主题的分区限制，因此必须注意不要使群集不堪重负。

如何在批处理模式下使用Spark从Kafka主题加载所有记录

1 个答案: