有人可以在S3或任何文件系统中向我推荐一个好的示例或示例来编写avro吗?我正在使用自定义接收器,但我想传递一些属性Map通过SinkProvider的构造函数,可以进一步传递到接收器,我猜?
更新代码:
val query = df.mapPartitions { itr =>
itr.map { row =>
val rowInBytes = row.getAs[Array[Byte]]("value")
MyUtils.deserializeAvro[GenericRecord](rowInBytes).toString
}
}.writeStream
.format("com.test.MyStreamingSinkProvider")
.outputMode(OutputMode.Append())
.queryName("testQ" )
.trigger(ProcessingTime("10 seconds"))
.option("checkpointLocation", "my_checkpoint_dir")
.start()
query.awaitTermination()
接收方提供商:
class MyStreamingSinkProvider extends StreamSinkProvider {
override def createSink(sqlContext: SQLContext, parameters: Map[String, String], partitionColumns: Seq[String], outputMode: OutputMode): Sink = {
new MyStreamingSink
}
}
水槽:
class MyStreamingSink extends Sink with Serializable {
final val log: Logger = LoggerFactory.getLogger(classOf[MyStreamingSink])
override def addBatch(batchId: Long, data: DataFrame): Unit = {
//For saving as text doc
data.rdd.saveAsTextFile("path")
log.warn(s"Total records processed: ${data.count()}")
log.warn("Data saved.")
}
}
答案 0 :(得分:1)
您应该可以通过writeStream.option(key, value)
DataStreamWriter writer = dataset.writeStream()
.format("com.test.MyStreamingSinkProvider")
.outputMode(OutputMode.Append())
.queryName("testQ" )
.trigger(ProcessingTime("10 seconds"))
.option("key_1", "value_1")
.option("key_2", "value_2")
.start()
在这种情况下,方法parameters
中的MyStreamingSinkProvider.createSink(...)
将包含key_1
和key_2