Spark流发送到kafka(0.10)会产生过多的连接问题

时间:2018-09-14 14:02:32

标签: apache-spark apache-kafka spark-streaming

我正在使用Spark Streaming(2.1)将一些数据发送到kafka(一些0.10版本,封装了一下)。

我像这样封装了我自己的kafka生产者:

case class MyKafkaProducer[KT, VT](props: Properties) extends Serializable{
  private lazy val producer = new KafkaProducer[KT, VT](props)

  def send(record: ProducerRecord[KT, VT], callback: Callback = null) = producer.send(record, callback)


  def flush = producer.flush

  def close = producer.close
}

我尝试过以下三种方式将数据发送到kafka,每种方式都有一些问题。最常见的问题(1&2)是从kafka集群计算机上发现到kafka的连接过多,而不是运行日志。

在第三个示例中,我没有注意到太多的连接问题(我不知道它是否遇到了,有人帮助分析?),并且在关闭一个生产者之后它无法发送。

1。广播

//too many connections

val producer = ssc.sparkContext.broadcast(MyKafkaProducer[Array[Byte], Array[Byte]](getProperties(conf))).value

dataStream.foreachRDD {
  _.foreachPartition { it =>
    it.foreach { row =>
      val value = row.toString().getBytes("UTF-8")

      def run(deep : Int = 0): Unit = {
        if(deep < retryTimes){
          val record = new ProducerRecord[Array[Byte], Array[Byte]](topic, key, value)
          producer.send(record, new Callback {
            override def onCompletion(recordMetadata: RecordMetadata, e: Exception) = if(e != null) run(deep + 1) })
        }
      }

      run()

    }
    producer.flush
  }
}

2.a生产者每条记录

//too many connections too
//blocked(send too slow), maybe producer has too much overhead?

dataStream.foreachRDD {
  _.foreach { row =>
    val producer = KafkaProducer[Array[Byte], Array[Byte]](getProperties(conf))

    val value = row.toString().getBytes("UTF-8")

    def run(deep : Int = 0): Unit ={
      if(deep < retryTimes){
        val record = new ProducerRecord[Array[Byte], Array[Byte]](topic, key, value)
        producer.send(record, new Callback {
          override def onCompletion(recordMetadata: RecordMetadata, e: Exception) = if(e != null) run(deep + 1)
        })
      }
    }

    run()

    producer.flush
    producer.close
  }
}

3.a生产者每项任务

//do not know whether too many connections
//cannot send after the producer is closed

dataStream.foreachRDD {
  _.foreachPartition { it =>
    Thread.sleep(1000)
    val producer = MyKafkaProducer[Array[Byte], Array[Byte]](getProperties(conf))

    it.foreach { row =>
      val value = row.toString().getBytes("UTF-8")

      def run(deep : Int = 0): Unit ={
        if(deep < retryTimes){
          val record = new ProducerRecord[Array[Byte], Array[Byte]](topic, key, value)
          producer.send(record, new Callback {
            override def onCompletion(recordMetadata: RecordMetadata, e: Exception) = if(e != null) run(deep + 1)
          })
        }
      }

      run()
    }
    producer.flush
    producer.close
  }
}

我是该领域的新手,不知道如何分析。我有一些问题:

  1. 为什么这三种情况都失败了?(连接过多/关闭后无法产生)
  2. 为什么此(https://github.com/harishreedharan/spark-streaming-kafka-output/blob/master/src/main/scala/org/cloudera/spark/streaming/kafka/KafkaWriter.scala)广播案例有效?我的代码和那个代码有什么区别?
  3. 是否有任何标准方法可将火花数据写入卡夫卡?

0 个答案:

没有答案