Question

这是结构化流媒体的一个众所周知的限制[1]，我试图使用自定义接收器来解决这个问题。

在下文中，modelsMap是org.apache.spark.mllib.stat.KernelDensity模型

的字符串键映射

和 streamingData是一个流式数据框org.apache.spark.sql.DataFrame = [id1: string, id2: string ... 6 more fields]

我尝试根据streamingData对应的模型评估modelsMap的每一行，使用prediction增强每一行，然后写入Kakfa。

一种显而易见的方法是.withColumn，使用UDF进行预测，并使用kafka sink进行编写。

但这是违法的，因为：

org.apache.spark.SparkException: This RDD lacks a SparkContext. It 
could happen in the following cases: (1) RDD transformations and 
actions are NOT invoked by the driver, but inside of other 
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is 
invalid because the values transformation and count action cannot be 
performed inside of the rdd1.map transformation. For more information, 
see SPARK-5063.

我使用实现forEachWriter的自定义接收器得到了同样的错误，这有点出乎意料：

    import org.apache.spark.sql.ForeachWriter
    import java.util.Properties
    import kafkashaded.org.apache.kafka.clients.producer._

     class  customSink(topic:String, servers:String) extends ForeachWriter[(org.apache.spark.sql.Row)] {
          val kafkaProperties = new Properties()
          kafkaProperties.put("bootstrap.servers", servers)
          kafkaProperties.put("key.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
          kafkaProperties.put("value.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
          val results = new scala.collection.mutable.HashMap[String, String]
          var producer: KafkaProducer[String, String] = _

          def open(partitionId: Long,version: Long): Boolean = {
            producer = new KafkaProducer(kafkaProperties)
            true
          }

          def process(value: (org.apache.spark.sql.Row)): Unit = {
            var prediction = Double.NaN
            try {
                val id1 = value(0)
                val id2 = value(3)
                val id3 = value(5)
                val time_0 = value(6).asInstanceOf[Double]
                val key = f"$id1/$id2/$id3" 
                var model = modelsMap(key)
                println("Looking up key: ",key)
                var prediction = Double.NaN
                prediction = model.estimate(Array[Double](time_0))(0)
                println(prediction)
            } catch {
                case e: NoSuchElementException =>
                val prediction = Double.NaN
                println(prediction)
            }    
              producer.send(new ProducerRecord(topic, value.mkString(",")+","+prediction.toString))
          }

          def close(errorOrNull: Throwable): Unit = {
            producer.close()
          }
       }

val writer = new customSink("<broker>", "<topic>")

val query = streamingData
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(Trigger.ProcessingTime(10.seconds))
.start()

model.estimate是在mllib.stat中使用aggregate实现的，并且没有办法绕过它。

我会做出哪些更改？（我可以collect每个批次并使用驱动程序执行for循环，但之后我没有按照预期的方式使用spark

参考文献：

无法在结构化流上评估ML模型，因为在其他转换中调用RDD转换和操作

0 个答案: