Kafka流:在运行时添加动态字段以进行avro记录

时间:2017-09-25 22:20:19

标签: scala apache-kafka avro apache-kafka-streams

我想实现一个可配置的Kafka流,它读取一行数据并应用转换列表。就像将函数应用于记录的字段,重命名字段等一样。流应该是完全可配置的,因此我可以指定哪些变换应该应用于哪个字段。我使用Avro将数据编码为GenericRecords。我的问题是我还需要创建新列的转换。它们不应覆盖字段的先前值,而应将新字段附加到记录中。这意味着记录的架构会发生变化。到目前为止我提出的解决方案是首先迭代转换列表,以确定我需要添加到架构的哪些字段。然后我创建一个新的模式,其中包含旧字段和新字段

转换列表(总是会有一个源字段传递给transform方法,然后将结果写回targetField):

val transforms: List[Transform] = List(
    FieldTransform(field = "referrer", targetField = "referrer", method = "mask"),
    FieldTransform(field = "name", targetField = "name_clean", method = "replaceUmlauts")
)

case class FieldTransform(field: String, targetField: String, method: String)

基于旧架构和转换列表创建新架构的方法

def getExtendedSchema(schema: Schema, transforms: List[Transform]): Schema = {    

var newSchema = SchemaBuilder
      .builder(schema.getNamespace)
      .record(schema.getName)
      .fields()

    // create new schema with existing fields from schemas and new fields which are created through transforms
    val fields = schema.getFields ++ getNewFields(schema, transforms)

    fields
      .foldLeft(newSchema)((newSchema, field: Schema.Field) => {
        newSchema
          .name(field.name)
          .`type`(field.schema())
          .noDefault()
          // TODO: find way to differentiate between explicitly set null defaults and fields which have no default
          //.withDefault(field.defaultValue())
      })

     newSchema.endRecord()
   }



 def getNewFields(schema: Schema, transforms: List[Transform]): List[Schema.Field] = {
    transforms
      .filter { // only select targetFields which are not in schema
        case FieldTransform(field, targetField, method) =>  schema.getField(targetField) == null
        case _ => false
      }
      .distinct
    .map { // create new Field object for each targetField
      case FieldTransform(field, targetField, method) =>
      val sourceField = schema.getField(field)
      new Schema.Field(targetField, sourceField.schema(), sourceField.doc(), sourceField.defaultValue())
    }
}

基于旧记录

实例化新的GenericRecord
 val extendedSchema = getExtendedSchema(row.getSchema, transforms)
 val extendedRow = new GenericData.Record(extendedSchema)

 for (field <- row.getSchema.getFields) {
     extendedRow.put(field.name, row.get(field.name))
 }

我试图寻找其他解决方案,但无法找到任何更改数据类型的示例。我觉得必须有一个更简单的清洁解决方案来处理运行时更改的Avro架构。任何想法都表示赞赏。

谢谢, 保罗

1 个答案:

答案 0 :(得分:0)

我已经实现了将动态值传递到您的avro模式并验证到模式中的联合

示例:-

RestTemplate template = new RestTemplate();
        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);
        HttpEntity<String> entity = new HttpEntity<String>(headers);
        ResponseEntity<String> response = template.exchange(""+registryUrl+"/subjects/"+topic+"/versions/"+version+"", HttpMethod.GET, entity, String.class);
        String responseData = response.getBody();
        JSONObject jsonObject = new JSONObject(responseData); // add your json string which you will pass from postman
        JSONObject jsonObjectResult = new JSONObject(jsonResult);
        String getData = jsonObject.get("schema").toString();
        Schema.Parser parser = new Schema.Parser();
        Schema schema = parser.parse(getData);
        GenericRecord genericRecord = new GenericData.Record(schema);
        schema.getFields().stream().forEach(field->{
            genericRecord.put(field.name(),jsonObjectResult.get(field.name()));
        });
        GenericDatumReader<GenericRecord>reader = new GenericDatumReader<GenericRecord>(schema);
        boolean data = reader.getData().validate(schema,genericRecord );