需要帮助推断NiFi中json文件的avro架构

时间:2017-05-22 01:17:10

标签: json avro apache-nifi

我正在尝试在NiFi中创建一个流,该流采用有效的json文件,并使用PutHiveStreaming处理器将其直接放入hive表中。我的json看起来如下:

{
"Raw_Json": {
    "SystemInfo": {
        "Id": "a string ID",
        "TM": null,
        "CountID": "a string ID",
        "Topic": null,
        "AccountID": "some number",
        "StationID": "some number",
        "STime": "some Timestamp",
        "ETime": "some Timestamp"
    },
    "Profile": {
        "ID": "ID number",
        "ProductID": "Some Number",
        "City": "City Name",
        "State": "State Name",
        "Number": "XXX-XXX-XXXX",
        "ExtNumber": null,
        "Unit": null,
        "Name": "Person Name",
        "Service": "Purchase",
        "AddrID": "00000000",
        "Products": {
            "Product": [{
                "Code": "CODE",
                "Description": "some description"

            },
            {
                "Code": "CODE",
                "Description": "some description"

            },
            {
                "Code": "CODE",
                "Description": "some description"

            },
            {
                "Code": "CODE",
                "Description": "some description"

            },
            {
                "Code": "CODE",
                "Description": "some description"

            },
            {
                "Code": "CODE",
                "Description": "some description"

            },
            {
                "Code": "CODE",
                "Description": "some description"

            },
            {
                "Code": "CODE",
                "Description": "some description"

            },
            {
                "Code": "CODE",
                "Description": "some description"

            },
            {
                "Code": "CODE",
                "Description": "some description"
            }]
        }
    },
    "Total": {
        "Amount": "some amount",
        "Delivery": "some address",
        "Estimate": "some amount",
        "Tax": null,
        "Delivery_Type": null

    }

},
"partition_date":"2017-05-19"

}

我正在使用InferAvroSchema处理器获取json,并通过使用推断的avro架构将json转换为avro格式并将其发送到PutHiveStreaming处理器。 My Flow看起来像这样:

主要目标是我希望将所有“Raw_Json”列转储到hive表中的一列中,并且该表将由“partition_date”列分区,该列将是表的第二列。问题是,出于某种原因,NiFi在从“Raw_Json”列推断嵌套json时遇到问题,并将其像Null一样转储到表中,如下所示:

有谁知道如何让NiFi将“Raw_Json”列的整个嵌套Json作为一列读取并将其发送到hive表?我怎么能为它创建自己的avro架构呢?任何有关如何解决此问题的见解或想法将不胜感激!

1 个答案:

答案 0 :(得分:2)

通常,只要您的输入文件格式始终相同,您就必须创建或生成(推断)avro架构一次 - 两个字段Raw_Jsonpartition_date

你应该在文件中有这样的东西,例如avro-schema.json

{
  "type" : "record",
  "name" : "test",
  "fields" : [ {
    "name" : "Raw_Json",
    "type" : 
    ...
  }, {
    "name" : "partition_date",
    "type" : "string",
    "doc" : "Type inferred from '\"2017-05-19\"'"
  } ]
}

并将此文件用作Record Schema处理器中的ConvertJSONToAvro

Raw_Json的类型:

或者您必须使用所有嵌套字段,数组等完全定义复杂数据类型。

或者如果您想将Raw_Json的内容写入字符串列,则必须在将文件转换为avro之前将其转换为字符串。您可以使用EvaluateJsonPathAttributesToJson处理器的序列。