Question

我想用Spark 2.2.0编写Avro记录，其中模式具有名称空间和其中的一些嵌套记录。

{
    "type": "record",
    "name": "userInfo",
    "namespace": "my.example",
    "fields": [
        {
            "name": "username",
            "type": "string"
        },
        {
            "name": "address",
            "type": [
                "null",
                {
                    "type": "record",
                    "name": "address",
                    "fields": [
                        {
                            "name": "street",
                            "type": [
                                "null",
                                "string"
                            ],
                            "default": null
                        },
                        {
                            "name": "box",
                            "type": [
                                "null",
                                {
                                    "type": "record",
                                    "name": "box",
                                    "fields": [
                                        {
                                            "name": "id",
                                            "type": "string"
                                        }
                                    ]
                                }
                            ],
                            "default": null
                        }
                    ]
                }
            ],
            "default": null
        }
    ]
}

我需要写出以下记录：

{
    "username": "tom taylor",
    "address": {
        "my.example.address": {
            "street": {
                "string": "unknown"
            },
            "box": {
                "my.example.box": {
                    "id": "id1"
                }
            }
        }
    }
}

但是，当我读取一些带有spark-avro（4.0.0）的Avro GenericRecords并进行一些转换（例如：我要添加一个名称空间）并想写出输出内容时：

df.foreach {
    ...
    .write
    .option("recordName", "userInfo")
    .option("recordNamespace", "my.example")
    ...
}

然后在生成的GenericRecord中，嵌套记录的名称空间将包含从父级到该元素的“完整路径”。即，我得到的是 my.example.address.box ，而不是 my.example.box 。当我尝试使用模式将记录重新读回时，就会出现不匹配的情况。

为编写者定义名称空间的正确方法是什么？

Spark Avro记录嵌套结构的命名空间生成

0 个答案: