apache-spark - spark数据集如何按列划分并另存为其他avro模式

给出了通用的Avro模式schemaA作为 { shard, payload } 记录存储在avroA中的记录

并且有效负载本身具有avro模式schemaB；

我想按碎片对存储在avroA中的数据进行分区，并使用schemaB将其存储在avroB中。我该怎么做？

  Dataset<Row> avroDs = spark.read()
                             .format("avro")
                             .load(avroA);
  String schemaB = "{ json schema }";
  avroDs.toDF()
              .select("shard", "body")
              .write()
              .partitionBy("shard")
              .format("avro")
              .option("avroSchema", schemaB)
              .save(avroB);

以上操作很可能失败，因为选择的内容不符合schemaB，而是符合schemaA。

指针会有所帮助。

预先感谢

spark数据集如何按列划分并另存为其他avro模式

0 个答案: