//Streaming read from Kafka
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokerPool)
.option("subscribe", topic)
.option("startingOffsets", "latest")
.load()
//Trying to decode it to a Dataset of type MyClass (avro generated class)
val result: DataSet[MyCaseClass] = df.mapPartitions(
partition => {
val flumeReader = new SpecificDatumReader(classOf[AvroFlumeEvent])
val datumReader = new SpecificDatumReader(classOf[MyCaseClass])
partition
.map(
row => {
val flumeBytes = row.getAs[Array[Byte]]("value")
val flumeBinaryDecoder =
DecoderFactory.get.binaryDecoder(flumeBytes, null)
flumeReader.read(null, flumeBinaryDecoder)
}
)
.map(flumeRecord => {
val recordBytes = flumeRecord.getBody
val recordBinaryDecoder =
DecoderFactory.get.binaryDecoder(recordBytes.array(), null)
datumReader.read(null, recordBinaryDecoder)
})
}
)
//Streaming write to HDFS
val streamingQuery = result.writeStream
.format("parquet")
.option("startingOffsets", "latest")
.option("path", "/foo/bar")
.option("checkpointLocation", "bar/foo")
.start()
.awaitTermination()
当我尝试使用镶木地板工具将其读出时,我只能看到二进制数据。
查看实际数据的唯一方法是读取这样的数据,
result.map(_.getName.toString)(Encoders.STRING)