无法在结构化流上评估ML模型,因为在其他转换中调用RDD转换和操作

时间:2018-05-03 20:16:46

标签: apache-spark apache-spark-ml spark-structured-streaming

这是结构化流媒体的一个众所周知的限制[1],我试图使用自定义接收器来解决这个问题。

在下文中,modelsMaporg.apache.spark.mllib.stat.KernelDensity模型

的字符串键映射

streamingData是一个流式数据框org.apache.spark.sql.DataFrame = [id1: string, id2: string ... 6 more fields]

我尝试根据streamingData对应的模型评估modelsMap的每一行,使用prediction增强每一行,然后写入Kakfa。

一种显而易见的方法是.withColumn,使用UDF进行预测,并使用kafka sink进行编写。

但这是违法的,因为:

org.apache.spark.SparkException: This RDD lacks a SparkContext. It 
could happen in the following cases: (1) RDD transformations and 
actions are NOT invoked by the driver, but inside of other 
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is 
invalid because the values transformation and count action cannot be 
performed inside of the rdd1.map transformation. For more information, 
see SPARK-5063.

我使用实现forEachWriter的自定义接收器得到了同样的错误,这有点出乎意料:

    import org.apache.spark.sql.ForeachWriter
    import java.util.Properties
    import kafkashaded.org.apache.kafka.clients.producer._

     class  customSink(topic:String, servers:String) extends ForeachWriter[(org.apache.spark.sql.Row)] {
          val kafkaProperties = new Properties()
          kafkaProperties.put("bootstrap.servers", servers)
          kafkaProperties.put("key.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
          kafkaProperties.put("value.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
          val results = new scala.collection.mutable.HashMap[String, String]
          var producer: KafkaProducer[String, String] = _

          def open(partitionId: Long,version: Long): Boolean = {
            producer = new KafkaProducer(kafkaProperties)
            true
          }

          def process(value: (org.apache.spark.sql.Row)): Unit = {
            var prediction = Double.NaN
            try {
                val id1 = value(0)
                val id2 = value(3)
                val id3 = value(5)
                val time_0 = value(6).asInstanceOf[Double]
                val key = f"$id1/$id2/$id3" 
                var model = modelsMap(key)
                println("Looking up key: ",key)
                var prediction = Double.NaN
                prediction = model.estimate(Array[Double](time_0))(0)
                println(prediction)
            } catch {
                case e: NoSuchElementException =>
                val prediction = Double.NaN
                println(prediction)
            }    
              producer.send(new ProducerRecord(topic, value.mkString(",")+","+prediction.toString))
          }

          def close(errorOrNull: Throwable): Unit = {
            producer.close()
          }
       }

val writer = new customSink("<broker>", "<topic>")

val query = streamingData
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(Trigger.ProcessingTime(10.seconds))
.start()

model.estimate是在mllib.stat中使用aggregate实现的,并且没有办法绕过它。

我会做出哪些更改? (我可以collect每个批次并使用驱动程序执行for循环,但之后我没有按照预期的方式使用spark

参考文献:

  1. https://www.slideshare.net/databricks/realtime-machine-learning-analytics-using-structured-streaming-and-kinesis-firehose幻灯片#11提及限制

  2. https://www.oreilly.com/learning/extend-structured-streaming-for-spark-ml

  3. https://github.com/holdenk/spark-structured-streaming-ml(建议的解决方案)

  4. https://issues.apache.org/jira/browse/SPARK-16454

  5. https://issues.apache.org/jira/browse/SPARK-16407

0 个答案:

没有答案