这是结构化流媒体的一个众所周知的限制[1],我试图使用自定义接收器来解决这个问题。
在下文中,modelsMap
是org.apache.spark.mllib.stat.KernelDensity
模型
和
streamingData
是一个流式数据框org.apache.spark.sql.DataFrame = [id1: string, id2: string ... 6 more fields]
我尝试根据streamingData
对应的模型评估modelsMap
的每一行,使用prediction
增强每一行,然后写入Kakfa。
一种显而易见的方法是.withColumn
,使用UDF进行预测,并使用kafka sink进行编写。
但这是违法的,因为:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It
could happen in the following cases: (1) RDD transformations and
actions are NOT invoked by the driver, but inside of other
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is
invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation. For more information,
see SPARK-5063.
我使用实现forEachWriter
的自定义接收器得到了同样的错误,这有点出乎意料:
import org.apache.spark.sql.ForeachWriter
import java.util.Properties
import kafkashaded.org.apache.kafka.clients.producer._
class customSink(topic:String, servers:String) extends ForeachWriter[(org.apache.spark.sql.Row)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
kafkaProperties.put("value.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (org.apache.spark.sql.Row)): Unit = {
var prediction = Double.NaN
try {
val id1 = value(0)
val id2 = value(3)
val id3 = value(5)
val time_0 = value(6).asInstanceOf[Double]
val key = f"$id1/$id2/$id3"
var model = modelsMap(key)
println("Looking up key: ",key)
var prediction = Double.NaN
prediction = model.estimate(Array[Double](time_0))(0)
println(prediction)
} catch {
case e: NoSuchElementException =>
val prediction = Double.NaN
println(prediction)
}
producer.send(new ProducerRecord(topic, value.mkString(",")+","+prediction.toString))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
val writer = new customSink("<broker>", "<topic>")
val query = streamingData
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(Trigger.ProcessingTime(10.seconds))
.start()
model.estimate
是在mllib.stat中使用aggregate
实现的,并且没有办法绕过它。
我会做出哪些更改? (我可以collect
每个批次并使用驱动程序执行for循环,但之后我没有按照预期的方式使用spark
参考文献: