所以,我有一个生产者从hdfs读取一个日志文件,然后逐行发送到kafka。另一方面,我使用API kafkaUtils在SparkStreaming上使用消息。我的目标是对日志进行实时分析。问题是只要生产者正在运行,消费者才会获得消息。当生产者完成发送(即成功地在下面的生产者代码中出现了while循环)时,接收者停止向我显示消息,尽管它没有消耗掉所有内容。 但是,如果我使用Kafka的开箱即用的简单消费者,它会在我运行脚本时向我显示所有消息。
/usr/lib/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic topic2 --from-beginning
我有一个名为 topic2 的主题,其配置如下所示:
Topic:topic2 PartitionCount:1 ReplicationFactor:2 Configs:
Topic: topic2 Partition: 0 Leader: 1 Replicas: 1,0 Isr: 0,1
经纪人在localhost:9092(id:0),localhost:9093(id:1)和localhost:9094(id:2)
此外,我在Cloudera Quickstart VM上运行Eclipse Luna上的所有内容(如果它有所作为)
SparkConsumer.java
public class SparkConsumer {
private static Function2<Long, Long, Long> SUM_REDUCER = (a, b) -> a + b;
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: JavaDirectKafkaWordCount <brokers> <topics>\n" +
" <brokers> is a list of one or more Kafka brokers\n" +
" <topics> is a list of one or more kafka topics to consume from\n\n");
System.exit(1);
}
String brokers = args[0]; //This recieves localhost:9092,localhost:9093,localhost:9094
String topics = args[1]; //This recieves *topic2*
SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("SparkConsumer").set("spark.driver.host", "localhost");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
//JavaSQLContext sqlContext = new JavaSQLContext(sc);
// Create a StreamingContext with a 1 second batch size
JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(1));
Set<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", brokers);
kafkaParams.put("auto.offset.reset", "largest");
// Create direct kafka stream with brokers and topics
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
// Get the lines, split them into words, count the words and print
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
@Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
lines.print();
//Here's where I plan to do the procesing using Spark when I'm done with this basic thing.
jssc.start();
jssc.awaitTermination();}}
HdfsProducer.java
public class HdfsProducer {
public static void readFromHdfs(Producer<String, String> producer, String topicName) {
try{
//1. Get the instance of Configuration
Configuration configuration = new Configuration();
//2. URI of the file to be read
URI uri = new URI("hdfs://0.0.0.0:8022/data/apache-access-log.txt");
//3. Get the instance of the HDFS
FileSystem hdfs = FileSystem.get(uri, configuration);
Path pt = new Path(uri);
BufferedReader br=new BufferedReader(new InputStreamReader(hdfs.open(pt)));
String line;
//System.out.println("Hello World");
//System.out.println("\n Successfully read hadoop file");
line=br.readLine();
int count = 0;
//while (line != null && count < 5){
while (line != null){
System.out.println("Sending batch" + count);
producer.send(new ProducerRecord<String, String>(topicName, new String(line)));
line=br.readLine();
count = count+1;
}
producer.close();
}catch(Exception e){
}
}
public static Producer<String, String> getProducer(String topicName)throws Exception{
// create instance for properties to access producer configs
Properties props = new Properties();
//Assign localhost id
props.put("bootstrap.servers", "localhost:9093");
//Set acknowledgements for producer requests.
props.put("acks", "all");
//If the request fails, the producer can automatically retry,
props.put("retries", 0);
//Specify buffer size in config
props.put("batch.size", 16384);
//Reduce the no of requests less than 0
props.put("linger.ms", 5);
//The buffer.memory controls the total amount of memory available to the producer for buffering.
props.put("buffer.memory", 33554432);
props.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer
<String, String>(props);
return producer;
}
public static void main(String[] args) throws Exception{
// Check arguments length value
if(args.length == 0){
System.out.println("Enter topic name");
return;
}
//Assign topicName to string variable
String topicName = args[0].toString(); //This receives *topic2*
//Get a producer and then use it to read the logs from HDFS
readFromHdfs(getProducer(topicName), topicName);
}}
另外,我尝试使用auto.offset.reset = smallest来运行它。在这种情况下,我收到的消息很少,我在一开始就发送过,然后我什么都没得到。 消费者应该在偏移最小时看到所有消息,无论消息何时被发送? (假设所有消息都持久存在于代理上,因为简单的消费者工作正常)。我确实阅读了kafka的文档,kafka集成了spark,spark-streaming但是无法继续。不熟悉Java也没有帮助:)如果有人能指出我在这里失踪的东西,我真的很感激。提前谢谢!