Question

我是引发kafka的结构化流和偏移管理的新手。使用spark-streaming-kafka-0-10-2.11。在消费者中，如何从主题的特定分区中读取内容？

comapany_df = sparkSession
                      .readStream()
                      .format("kafka")
                      .option("kafka.bootstrap.servers", applicationProperties.getProperty(BOOTSTRAP_SERVERS_CONFIG))
                      .option("subscribe", topicName)

我正在使用类似上面的内容。如何指定要读取的特定分区？

Answer 1

您可以使用以下代码块从特定的Kafka分区读取。

public void processKafka() throws InterruptedException {
    LOG.info("************ SparkStreamingKafka.processKafka start");

   // Create the spark application
    SparkConf sparkConf = new SparkConf();
    sparkConf.set("spark.executor.cores", "5");

    //To express any Spark Streaming computation, a StreamingContext object needs to be created. 
    //This object serves as the main entry point for all Spark Streaming functionality.
    //This creates the spark streaming context with a 'numSeconds' second batch size
    jssc = new JavaStreamingContext(sparkConf, Durations.seconds(sparkBatchInterval));


    //List of parameters
    Map<String, Object> kafkaParams = new HashMap<>();
    kafkaParams.put("bootstrap.servers", this.getBrokerList());
    kafkaParams.put("client.id", "SpliceSpark");
    kafkaParams.put("group.id", "mynewgroup");
    kafkaParams.put("auto.offset.reset", "earliest");
    kafkaParams.put("enable.auto.commit", false);
    kafkaParams.put("key.deserializer", StringDeserializer.class);
    kafkaParams.put("value.deserializer", StringDeserializer.class);

    List<TopicPartition> topicPartitions= new ArrayList<TopicPartition>();
    for(int i=0; i<5; i++) {
        topicPartitions.add(new TopicPartition("mytopic", i));
    }


    //List of kafka topics to process
    Collection<String> topics = Arrays.asList(this.getTopicList().split(","));


    JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
            jssc,
            LocationStrategies.PreferConsistent(),
            ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
          );

    //Another version of an attempt
    /*
    JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
        jssc,
        LocationStrategies.PreferConsistent(),
        ConsumerStrategies.<String, String>Assign(topicPartitions, kafkaParams)
      );
     */

    messages.foreachRDD(new PrintRDDDetails());


    // Start running the job to receive and transform the data 
    jssc.start();

    //Allows the current thread to wait for the termination of the context by stop() or by an exception
    jssc.awaitTermination();
}

Spark结构流式Kafka-如何从主题的特定分区读取内容并进行偏移量管理

1 个答案: