我正在尝试创建Apache Spark作业以使用提交到主题中的Kafka消息。使用kafka-console-producer向主题提交消息,如下所示。
./kafka-console-producer.sh --broker-list kafka1:9092 --topic my-own-topic
要阅读消息,我使用的是spark-streaming-kafka-0-10_2.11库。随着库设法读取收到的主题消息的总数。但我无法读取流中的ConsumerRecord对象,当我尝试读取它时,整个应用程序被阻止,无法将其打印到控制台。注意我在docker容器中运行Kafka,Zookeeper和Spark。非常感谢帮助。
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.TaskContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka010.ConsumerStrategies;
import org.apache.spark.streaming.kafka010.HasOffsetRanges;
import org.apache.spark.streaming.kafka010.KafkaUtils;
import org.apache.spark.streaming.kafka010.LocationStrategies;
import org.apache.spark.streaming.kafka010.OffsetRange;
public class SparkKafkaStreamingJDBCExample {
public static void main(String[] args) {
// Start a spark instance and get a context
SparkConf conf =
new SparkConf().setAppName("Study Spark").setMaster("spark://spark-master:7077");
// Setup a streaming context.
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(3));
// Create a map of Kafka params
Map<String, Object> kafkaParams = new HashMap<String, Object>();
// List of Kafka brokers to listen to.
kafkaParams.put("bootstrap.servers", "kafka1:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream");
// Do you want to start from the earliest record or the latest?
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", true);
// List of topics to listen to.
Collection<String> topics = Arrays.asList("my-own-topic");
// Create a Spark DStream with the kafka topics.
final JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(streamingContext, LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
System.out.println("Study Spark Example Starting ....");
stream.foreachRDD(rdd -> {
if (rdd.isEmpty()) {
System.out.println("RDD Empty " + rdd.count());
return;
} else {
System.out.println("RDD not empty " + rdd.count());
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
System.out.println("Partition Id " + TaskContext.getPartitionId());
OffsetRange o = offsetRanges[TaskContext.getPartitionId()];
System.out.println("Topic " + o.topic());
System.out.println("Creating RDD !!!");
JavaRDD<ConsumerRecord<String, String>> r =
KafkaUtils.createRDD(streamingContext.sparkContext(), kafkaParams, offsetRanges,
LocationStrategies.PreferConsistent());
System.out.println("Count " + r.count());
//Application stuck from here onwards ...
ConsumerRecord<String, String> first = r.first();
System.out.println("First taken");
System.out.println("First value " + first.value());
}
});
System.out.println("Stream context starting ...");
// Start streaming.
streamingContext.start();
System.out.println("Stream context started ...");
try {
System.out.println("Stream context await termination ...");
streamingContext.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
下面给出了示例输出。
Study Spark Example Starting ....
Stream context starting ...
Stream context started ...
Stream context await termination ...
RDD Empty 0
RDD Empty 0
RDD Empty 0
RDD Empty 0
RDD not empty 3
Partition Id 0
Topic my-own-topic
Creating RDD !!!