Spark Streaming使用Kafka(Direct方法)只有在生产者运行时才消耗消息

时间:2017-05-30 22:41:36

标签: java apache-spark apache-kafka spark-streaming

所以,我有一个生产者从hdfs读取一个日志文件,然后逐行发送到kafka。另一方面,我使用API kafkaUtils在SparkStreaming上使用消息。我的目标是对日志进行实时分析。问题是只要生产者正在运行,消费者才会获得消息。当生产者完成发送(即成功地在下面的生产者代码中出现了while循环)时,接收者停止向我显示消息,尽管它没有消耗掉所有内容。 但是,如果我使用Kafka的开箱即用的简单消费者,它会在我运行脚本时向我显示所有消息。

/usr/lib/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic topic2 --from-beginning

我有一个名为 topic2 的主题,其配置如下所示:

Topic:topic2    PartitionCount:1    ReplicationFactor:2 Configs:
Topic: topic2   Partition: 0    Leader: 1   Replicas: 1,0   Isr: 0,1

经纪人在localhost:9092(id:0),localhost:9093(id:1)和localhost:9094(id:2)

此外,我在Cloudera Quickstart VM上运行Eclipse Luna上的所有内容(如果它有所作为)

SparkConsumer.java

public class SparkConsumer {

private static Function2<Long, Long, Long> SUM_REDUCER = (a, b) -> a + b;

public static void main(String[] args) throws Exception {
    if (args.length < 2) {
      System.err.println("Usage: JavaDirectKafkaWordCount <brokers> <topics>\n" +
          "  <brokers> is a list of one or more Kafka brokers\n" +
          "  <topics> is a list of one or more kafka topics to consume from\n\n");
      System.exit(1);
    }

    String brokers = args[0]; //This recieves localhost:9092,localhost:9093,localhost:9094
    String topics = args[1]; //This recieves *topic2*

    SparkConf sparkConf = new  SparkConf().setMaster("local[2]").setAppName("SparkConsumer").set("spark.driver.host", "localhost");

    JavaSparkContext sc = new JavaSparkContext(sparkConf);
   //JavaSQLContext sqlContext = new JavaSQLContext(sc);

    // Create a StreamingContext with a 1 second batch size
    JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(1)); 

    Set<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
    Map<String, String> kafkaParams = new HashMap<>();
    kafkaParams.put("metadata.broker.list", brokers);
    kafkaParams.put("auto.offset.reset", "largest");
    // Create direct kafka stream with brokers and topics
    JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
        jssc,
        String.class,
        String.class,
        StringDecoder.class,
        StringDecoder.class,
        kafkaParams,
        topicsSet
    );

    // Get the lines, split them into words, count the words and print
    JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
      @Override
      public String call(Tuple2<String, String> tuple2) {
        return tuple2._2();
      }
    });

    lines.print(); 
    //Here's where I plan to do the procesing using Spark when I'm done with this basic thing.
    jssc.start();
    jssc.awaitTermination();}}

HdfsProducer.java

public class HdfsProducer {
public static void readFromHdfs(Producer<String, String> producer, String topicName) {

    try{
        //1. Get the instance of Configuration
        Configuration configuration = new Configuration();
        //2. URI of the file to be read
        URI uri = new URI("hdfs://0.0.0.0:8022/data/apache-access-log.txt");
        //3. Get the instance of the HDFS 
        FileSystem hdfs = FileSystem.get(uri, configuration);
        Path pt = new Path(uri);
        BufferedReader br=new BufferedReader(new InputStreamReader(hdfs.open(pt)));
        String line;
        //System.out.println("Hello World");
        //System.out.println("\n Successfully read hadoop file");
        line=br.readLine();
        int count = 0;
        //while (line != null && count < 5){
        while (line != null){
            System.out.println("Sending batch" + count);
            producer.send(new ProducerRecord<String, String>(topicName, new String(line)));
            line=br.readLine();
            count = count+1;
        }

        producer.close();
    }catch(Exception e){

    }
}

public static Producer<String, String> getProducer(String topicName)throws Exception{

    // create instance for properties to access producer configs   
            Properties props = new Properties();

            //Assign localhost id
            props.put("bootstrap.servers", "localhost:9093");

            //Set acknowledgements for producer requests.      
            props.put("acks", "all");

            //If the request fails, the producer can automatically retry,
            props.put("retries", 0);

            //Specify buffer size in config
            props.put("batch.size", 16384);

            //Reduce the no of requests less than 0   
            props.put("linger.ms", 5);

            //The buffer.memory controls the total amount of memory available to the producer for buffering.   
            props.put("buffer.memory", 33554432);

            props.put("key.serializer", 
                    "org.apache.kafka.common.serialization.StringSerializer");

            props.put("value.serializer", 
                    "org.apache.kafka.common.serialization.StringSerializer");

            Producer<String, String> producer = new KafkaProducer
                    <String, String>(props);

            return producer;
}

public static void main(String[] args) throws Exception{


    // Check arguments length value
    if(args.length == 0){
        System.out.println("Enter topic name");
        return;
    }

    //Assign topicName to string variable
    String topicName = args[0].toString(); //This receives *topic2*

    //Get a producer and then use it to read the logs from HDFS
    readFromHdfs(getProducer(topicName), topicName);


}}

另外,我尝试使用auto.offset.reset = smallest来运行它。在这种情况下,我收到的消息很少,我在一开始就发送过,然后我什么都没得到。 消费者应该在偏移最小时看到所有消息,无论消息何时被发送? (假设所有消息都持久存在于代理上,因为简单的消费者工作正常)。我确实阅读了kafka的文档,kafka集成了spark,spark-streaming但是无法继续。不熟悉Java也没有帮助:)如果有人能指出我在这里失踪的东西,我真的很感激。提前谢谢!

0 个答案:

没有答案