Question

通过结构化流媒体管道处理每条消息的“推荐”方法是什么（我在Spark 2.1.1上，源代码为Kafka 0.10.2.1）？

到目前为止，我正在查看dataframe.mapPartitions（因为我需要连接到HBase，其客户端连接类不可搜索，因此mapPartitions）。

想法？

Answer 1

您应该可以使用foreach输出接收器：https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks和https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach

即使客户端不可序列化，您也不必在ForeachWriter构造函数中打开它。只需将其保留为None / null，并在open方法中进行初始化，该方法在序列化后称为，但每个任务只能执行一次。

在伪代码中：

class HBaseForeachWriter extends ForeachWriter[MyType] { var client: Option[HBaseClient] = None def open(partitionId: Long, version: Long): Boolean = { client = Some(... open a client ...) } def process(record: MyType) = { client match { case None => throw Exception("shouldn't happen") case Some(cl) => { ... use cl to write record ... } } } def close(errorOrNull: Throwable): Unit = { client.foreach(cl => cl.close()) } }

结构化流 - 消费每条消息

1 个答案: