Question

我是Spark的新手，我想对现有protobuf进行更改。进行更改后，我想将该protobuf消息映射到Spark数据集行。由于protobuf消息很复杂且嵌套很深。

我不想创建架构然后复制值，这很繁琐且难以编写代码

类似这样的东西：

 Dataset<Row> events = spark
                .readStream()
                .format("kafka")
                .load();

//call mapper
events.mapPartitions()

....
...


//mapper code

ProtoMessage.message

//create schema
StructType SCHEMA = new StructType()
        .add("value1", DataTypes.StringType, false)
        .add("value2", DataTypes.StringType, false)


//create columns
Object[] columnes = {
message.getValue1(),
message.getValue2()
....

}

//create a row
Stream.<Row>of(new GenericRowWithSchema(columns, SCHEMA));

但是我不知道确切的列数（我确实知道，但是几乎不可能手工编写所有代码）基本上，我想做的是获取protobuf，更改一个字段，然后将整个对象转换为“数据集行”。

我研究了 sparksql-protobuf ，但是我想在推断模式后也复制值。

感谢您的帮助！

我想更改Protobuf中的值，然后将protobuf转换为Spark DataSet行

0 个答案: