我有一个具有记录的avro文件,然后在它们的字段(具有uniontypes)中还有其他记录,这些记录也具有带有union类型的字段,并且某些类型具有某些属性connect.name
,我需要检查它是否等于io.debezium.time.NanoTimestamp
。我正在Apache NiFi中使用带有Groovy脚本的ExecuteScript处理器来执行此操作。
Avro模式的简化示例:
{
"type": "record",
"name": "Envelope",
"namespace": "data.none.bpm.pruitsmdb_nautilus_dbo.fast_frequency_tables.avro.test",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "Value",
"fields": [
{
"name": "Id",
"type": {
"type": "string",
"connect.parameters": {
"__debezium.source.column.type": "UNIQUEIDENTIFIER",
"__debezium.source.column.length": "36"
}
}
},
{
"name": "CreatedOn",
"type": [
"null",
{
"type": "long",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "DATETIME2",
"__debezium.source.column.length": "27",
"__debezium.source.column.scale": "7"
},
"connect.name": "io.debezium.time.NanoTimestamp"
}
],
"default": null
},
{
"name": "CreatedById",
"type": [
"null",
{
"type": "string",
"connect.parameters": {
"__debezium.source.column.type": "UNIQUEIDENTIFIER",
"__debezium.source.column.length": "36"
}
}
],
"default": null
}
],
"connect.name": "data.none.bpm.pruitsmdb_nautilus_dbo.fast_frequency_tables.avro.test.Value"
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"Value"
],
"default": null
},
{
"name": "source",
"type": {
"type": "record",
"name": "Source",
"namespace": "io.debezium.connector.sqlserver",
"fields": [
{
"name": "version",
"type": "string"
},
{
"name": "ts_ms",
"type": "long"
},
{
"name": "snapshot",
"type": [
{
"type": "string",
"connect.version": 1,
"connect.parameters": {
"allowed": "true,last,false"
},
"connect.default": "false",
"connect.name": "io.debezium.data.Enum"
},
"null"
],
"default": "false"
}
],
"connect.name": "io.debezium.connector.sqlserver.Source"
}
},
{
"name": "op",
"type": "string"
},
{
"name": "ts_ms",
"type": [
"null",
"long"
],
"default": null
}
],
"connect.name": "data.none.bpm.pruitsmdb_nautilus_dbo.fast_frequency_tables.avro.test.Envelope"
}
我的Groovy代码显然似乎仅在检查顶级记录,而且我不确定我是否在正确检查属性connect.name
:
reader.forEach{ GenericRecord record ->
record.getSchema().getFields().forEach{ Schema.Field field ->
try {
field.schema().getTypes().forEach{ Schema typeSchema ->
if(typeSchema.getProp("connect.name") == "io.debezium.time.NanoTimestamp"){
record.put(field.name(), Long(record.get(field.name()).toString().substring(0, 13)))
typeSchema.addProp("logicalType", "timestamp-millis")
}
}
} catch(Exception ex){
println("Catching the exception")
}
}
writer.append(record)
}
我的问题是-如何遍历avro文件中的所有嵌套记录(其中有“记录”类型和记录的顶级记录字段)?并且遍历其字段时-如何正确检查其类型之一(可能会合并)具有属性connect.name == io.debezium.time.NanoTimestamp
,如果是,则对字段值执行转换并添加logicalType
属性字段的类型?
答案 0 :(得分:1)
我认为您在这里正在寻找递归-应该有一个函数可以接受Record作为参数。当您命中嵌套记录的字段时,将递归调用此函数。
答案 1 :(得分:1)
Jiri的方法建议有效,使用了递归函数,此处为完整代码:
import org.apache.avro.*
import org.apache.avro.file.*
import org.apache.avro.generic.*
//define input and output files
DataInputStream inputStream = new File('input.avro').newDataInputStream()
DataOutputStream outputStream = new File('output.avro').newDataOutputStream()
DataFileStream<GenericRecord> reader = new DataFileStream<>(inputStream, new GenericDatumReader<GenericRecord>())
DataFileWriter<GenericRecord> writer = new DataFileWriter<>(new GenericDatumWriter<GenericRecord>())
def contentSchema = reader.schema //source Avro schema
def records = [] //list will be used to temporary store the processed records
//function which is traversing through all records (including nested ones)
def convertAvroNanosecToMillisec(record){
record.getSchema().getFields().forEach{ Schema.Field field ->
if (record.get(field.name()) instanceof org.apache.avro.generic.GenericData.Record){
convertAvroNanosecToMillisec(record.get(field.name()))
}
if (field.schema().getType().getName() == "union"){
field.schema().getTypes().forEach{ Schema unionTypeSchema ->
if(unionTypeSchema.getProp("connect.name") == "io.debezium.time.NanoTimestamp"){
record.put(field.name(), Long.valueOf(record.get(field.name()).toString().substring(0, 13)))
unionTypeSchema.addProp("logicalType", "timestamp-millis")
}
}
} else {
if(field.schema().getProp("connect.name") == "io.debezium.time.NanoTimestamp"){
record.put(field.name(), Long.valueOf(record.get(field.name()).toString().substring(0, 13)))
field.schema().addProp("logicalType", "timestamp-millis")
}
}
}
return record
}
//reading all records from incoming file and adding to the temporary list
reader.forEach{ GenericRecord contentRecord ->
records.add(convertAvroNanosecToMillisec(contentRecord))
}
//creating a file writer object with adjusted schema
writer.create(contentSchema, outputStream)
//adding records to the output file from the temporary list and closing the writer
records.forEach{ GenericRecord contentRecord ->
writer.append(contentRecord)
}
writer.close()