我有一个从kafka队列中读取数据的独立spark集群。 kafka队列有5个分区,spark只处理来自其中一个分区的数据。我正在使用以下内容:
以下是我的maven依赖项:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>kafka-custom</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.1.1</version>
</dependency>
我的kafka制作人是一个简单的制作人,只是在队列中放了100条消息:
public void generateMessages() {
// Define the properties for the Kafka Connection
Properties props = new Properties();
props.put("bootstrap.servers", kafkaBrokerServer); // kafka server
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
// Create a KafkaProducer using the Kafka Connection properties
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(
props);
for (int i = 0; i < 100; i++) {
ProducerRecord<String, String> record = new ProducerRecord<>(kafkaTopic, "value-" + i);
producer.send(record);
}
producer.close();
}
以下是我的火花流工作中的主要代码:
public void processKafka() throws InterruptedException {
LOG.info("************ SparkStreamingKafka.processKafka start");
// Create the spark application
SparkConf sparkConf = new SparkConf();
sparkConf.set("spark.executor.cores", "5");
//To express any Spark Streaming computation, a StreamingContext object needs to be created.
//This object serves as the main entry point for all Spark Streaming functionality.
//This creates the spark streaming context with a 'numSeconds' second batch size
jssc = new JavaStreamingContext(sparkConf, Durations.seconds(sparkBatchInterval));
//List of parameters
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", this.getBrokerList());
kafkaParams.put("client.id", "SpliceSpark");
kafkaParams.put("group.id", "mynewgroup");
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
List<TopicPartition> topicPartitions= new ArrayList<TopicPartition>();
for(int i=0; i<5; i++) {
topicPartitions.add(new TopicPartition("mytopic", i));
}
//List of kafka topics to process
Collection<String> topics = Arrays.asList(this.getTopicList().split(","));
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
//Another version of an attempt
/*
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Assign(topicPartitions, kafkaParams)
);
*/
messages.foreachRDD(new PrintRDDDetails());
// Start running the job to receive and transform the data
jssc.start();
//Allows the current thread to wait for the termination of the context by stop() or by an exception
jssc.awaitTermination();
}
PrintRDDDetails的调用方法具有以下内容:
public void call(JavaRDD<ConsumerRecord<String, String>> rdd)
throws Exception {
LOG.error("--- New RDD with " + rdd.partitions().size()
+ " partitions and " + rdd.count() + " records");
}
似乎发生的是它只从一个分区获取数据。我已经在kafka确认有5个分区。当执行调用方法时,它会打印正确数量的分区,但只打印1个分区中的记录 - 并且我从这个简化代码中取出的进一步处理 - 显示它只处理1个分区。
答案 0 :(得分:4)
这似乎是Spark 2.1.0的问题,因为它使用了kafka-clients库的v0.10.1(根据以下pull请求):
https://github.com/apache/spark/pull/16278
我通过使用更新版本的kafka-clients库解决了这个问题:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kinesis-asl" % sparkVersion,
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % sparkVersion,
).map(_.exclude("org.apache.kafka", "kafka-clients"))
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.10.2.0"