对于我的spark 2.1.1和Kafka 0.10.2.1 strcutured stream示例,我能够通过foreach
接收器工作。我的流源配置为每10秒推送2条消息。
我看到前几条消息通过foreach
接收器(open-process-close)构造完成。但是,在第一次推送之后,该进程不再从队列中读取?
有什么想法吗?
val stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka10.broker")
.option("subscribe", src_topic)
.load()
val df = stream.selectExpr("cast (value as string) as json")
.select(functions.from_json(StringToColumn(StringContext.apply("json")).$(), txnSchema).as("data"))
.select("data.*")
作家实施:
val writer = new ForeachWriter[Row] {
/*
Prepare Hbase connection
Prepare Kafka Producer
*/
true
}
override def process(row: Row) = {
try {
/*
Do biz logic
get data from hbase
at the end, write to a kafka queue
*/
}
catch {
case tt: Throwable => {
rlog("Something else wierd happened.")
}
}
}
override def close(errorOrNull: Throwable) = {
println("-------------------------------In close now. checking whether it was called due to soem error")
if(errorOrNull != null)
errorOrNull .printStackTrace()
println("-------------------------------Closing hbase Connection")
//closing hbase connections
println("-------------------------------Closing Kafka connection now")
// closing kafka producer object
// Other cleanup
println("-------------------------------All done. Exiting now.")
}
}
在我的业务逻辑处理期间,我需要将数据从HBase转换为数据帧以进行进一步处理。我尝试做到这一点失败了。另一个绅士提到,它不是"允许"因为那部分是在执行者身上运行的。
对此的建议是什么?