Question

//Streaming read from Kafka
val df = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", brokerPool)
  .option("subscribe", topic)
  .option("startingOffsets", "latest")
  .load()

//Trying to decode it to a Dataset of type MyClass (avro generated class)
val result: DataSet[MyCaseClass] = df.mapPartitions(
  partition => {
    val flumeReader = new SpecificDatumReader(classOf[AvroFlumeEvent])
    val datumReader = new SpecificDatumReader(classOf[MyCaseClass])
    partition
      .map(
        row => {
          val flumeBytes = row.getAs[Array[Byte]]("value")
          val flumeBinaryDecoder =
            DecoderFactory.get.binaryDecoder(flumeBytes, null)
          flumeReader.read(null, flumeBinaryDecoder)
        }
      )
      .map(flumeRecord => {
        val recordBytes = flumeRecord.getBody
        val recordBinaryDecoder =
          DecoderFactory.get.binaryDecoder(recordBytes.array(), null)
        datumReader.read(null, recordBinaryDecoder)
      })
  }
)

//Streaming write to HDFS
val streamingQuery = result.writeStream
  .format("parquet")
  .option("startingOffsets", "latest")
  .option("path", "/foo/bar")
  .option("checkpointLocation", "bar/foo")
  .start()
  .awaitTermination()

当我尝试使用镶木地板工具将其读出时，我只能看到二进制数据。

查看实际数据的唯一方法是读取这样的数据，

result.map(_.getName.toString)(Encoders.STRING)

有更好的方法吗？
我如何阅读更复杂的数据结构，例如Map？
除了这些之外，我还可以将AvroFlumeEvent直接转换为scala case类。如果可以，怎么办？

将FluemEvent转换为CaseClass

0 个答案: