Spark Streaming不读所有Kafka唱片

时间:2017-08-09 10:32:09

标签: apache-spark apache-kafka spark-streaming

我们正在从kafka向SparkStreaming发送15条记录,但spark只收到11条记录。我使用的是spark 2.1.0和kafka_2.12-0.10.2.0。

CODE

import java.util.HashMap;
import java.util.Map;
import java.util.Properties;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;

import scala.Tuple2;

public class KafkaToSparkData {
 public static void main(String[] args) throws InterruptedException {
    int timeDuration = 100;
    int consumerNumberOfThreads = 1;
    String consumerTopic = "InputDataTopic";
    String zookeeperUrl = "localhost:2181";
    String consumerTopicGroup =  "testgroup";
    String producerKafkaUrl = "localhost:9092";
    String producerTopic =  "OutputDataTopic";
    String sparkMasterUrl = "local[2]";

    Map<String, Integer> topicMap = new HashMap<String, Integer>();
    topicMap.put(consumerTopic, consumerNumberOfThreads);

    SparkSession sparkSession = SparkSession.builder().master(sparkMasterUrl).appName("Kafka-Spark").getOrCreate();

    JavaSparkContext javaSparkContext = new JavaSparkContext(sparkSession.sparkContext());

    JavaStreamingContext javaStreamingContext = new JavaStreamingContext(javaSparkContext, new Duration(timeDuration));

    JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(javaStreamingContext, zookeeperUrl, consumerTopicGroup, topicMap);

    JavaDStream<String> NewRecord = messages.map(new Function<Tuple2<String, String>, String>() {
        private static final long serialVersionUID = 1L;

        public String call(Tuple2<String, String> line) throws Exception {

            String responseToKafka = "";
            System.out.println(" Data IS " + line);

            String ValueData = line._2;
            responseToKafka = ValueData + "|" + "0";

            Properties configProperties = new Properties();
            configProperties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, producerKafkaUrl);
            configProperties.put("key.serializer", org.apache.kafka.common.serialization.StringSerializer.class);
            configProperties.put("value.serializer", org.apache.kafka.common.serialization.StringSerializer.class);

            KafkaProducer<String, String> producer = new KafkaProducer<String, String>(configProperties);

            ProducerRecord<String, String> topicMessage = new ProducerRecord<String, String>(producerTopic,responseToKafka);
            producer.send(topicMessage);
            producer.close();

            return responseToKafka;
        }
    });

    System.out.println(" Printing Record" );
    NewRecord.print();

    javaStreamingContext.start();
    javaStreamingContext.awaitTermination();
    javaStreamingContext.close();

    }
}
卡夫卡制片人

bin / kafka-console-producer.sh --broker-list localhost:9092 --topic InputDataTopic # 1 2 3 4 五 6 7 8 9 10 11 12 13 14 15 16 17 18

卡夫卡消费者

bin / kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic OutputDataTopic - from-beginning # 1 | 0 2 | 0 3 | 0 4 | 0 5 | 0 6 | 0 7 | 0 8 | 0 9 | 0 10 | 0 11 | 0

有人可以帮我吗?

1 个答案:

答案 0 :(得分:2)

我们在这里看到的是延迟操作在Spark中如何工作的效果。 在这里,我们使用map操作引起副作用,即将一些数据发送到Kafka。

然后使用print实现流。默认情况下,print会显示流的前10个元素,但需要n+1个元素才能显示&#34; ...&#34;表明何时还有更多。

take(11)强制实现前11个元素的实现,因此它们将从原始流中获取并使用map函数进行处理。这导致部分发布到Kafka。

如何解决这个问题?好吧,提示已在上面:不要在map函数中使用副作用。 在这种情况下,使用流并将其发送到Kafka的正确输出操作应为foreachRDD

此外,为了避免为每个元素创建Kafka生成器实例,我们使用RDD处理内部foreachPartition

此过程的代码框架如下所示:

messages.foreachRDD{rdd => 
  rdd.foreachPartition{partitionIter => 
       producer = // create producer
       partitionIter.foreach{elem =>
           record = createRecord(elem)
           producer.send(record)
       }
       producer.flush()  
       producer.close()
   }
}