将Spark流式数据帧写入MongoDB

时间:2018-06-11 10:51:12

标签: java mongodb apache-spark spark-structured-streaming

我在Spark中有一个具有特定架构的Streaming Dataset。当我想在其上计算查询时,我称之为:

StreamingQuery query = querydf
                      .writeStream()
                      .outputMode(OutputMode.Update())
                      .format("console")
                      .start();           

query.awaitTermination();

通过这种方式,我可以在控制台中看到每个触发器的查询结果。如何在Mongo中编写结果DataFrame?对于Straming Dataset是不可能的。我应该将每个触发器的流Dataset转换为静态Dataset然后保存吗?我该怎么办?

1 个答案:

答案 0 :(得分:0)

您可以创建MongoDbSink

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.CatalystTypeConverters
import org.apache.spark.sql.execution.streaming.Sink
import org.apache.spark.sql.sources.{DataSourceRegister, StreamSinkProvider}
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{DataFrame, Row, SQLContext}

class MongoDbSink(options: Map[String, String]) extends Sink with Logging {

  override def addBatch(batchId: Long, data: DataFrame): Unit = synchronized {
    val schema = data.schema
    val rdd = data.queryExecution.toRdd.mapPartitions { rows =>
      val converter = CatalystTypeConverters.createToScalaConverter(schema)
      rows.map(converter(_).asInstanceOf[Row])
    }

    // write RDD to MongoDB!!
  }
}

class MongoDbSinkProvider extends StreamSinkProvider with DataSourceRegister {
  def createSink(sqlContext: SQLContext,
                 parameters: Map[String, String],
                 partitionColumns: Seq[String],
                 outputMode: OutputMode): Sink = {
    new MongoDbSink(parameters)
  }

  def shortName(): String = "my-mongo-sink"
}

然后按照你喜欢的方式实现对MongoDb的写入。

在writeStream的.format()中指定MongoDbSinkProvider

的路径