我在Spark中有一个具有特定架构的Streaming Dataset
。当我想在其上计算查询时,我称之为:
StreamingQuery query = querydf
.writeStream()
.outputMode(OutputMode.Update())
.format("console")
.start();
query.awaitTermination();
通过这种方式,我可以在控制台中看到每个触发器的查询结果。如何在Mongo中编写结果DataFrame?对于Straming Dataset
是不可能的。我应该将每个触发器的流Dataset
转换为静态Dataset
然后保存吗?我该怎么办?
答案 0 :(得分:0)
您可以创建MongoDbSink
:
import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.CatalystTypeConverters
import org.apache.spark.sql.execution.streaming.Sink
import org.apache.spark.sql.sources.{DataSourceRegister, StreamSinkProvider}
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{DataFrame, Row, SQLContext}
class MongoDbSink(options: Map[String, String]) extends Sink with Logging {
override def addBatch(batchId: Long, data: DataFrame): Unit = synchronized {
val schema = data.schema
val rdd = data.queryExecution.toRdd.mapPartitions { rows =>
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row])
}
// write RDD to MongoDB!!
}
}
class MongoDbSinkProvider extends StreamSinkProvider with DataSourceRegister {
def createSink(sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): Sink = {
new MongoDbSink(parameters)
}
def shortName(): String = "my-mongo-sink"
}
然后按照你喜欢的方式实现对MongoDb的写入。
在writeStream的.format()
中指定MongoDbSinkProvider