我正在尝试在NiFi中创建一个流,该流采用有效的json文件,并使用PutHiveStreaming处理器将其直接放入hive表中。我的json看起来如下:
{
"Raw_Json": {
"SystemInfo": {
"Id": "a string ID",
"TM": null,
"CountID": "a string ID",
"Topic": null,
"AccountID": "some number",
"StationID": "some number",
"STime": "some Timestamp",
"ETime": "some Timestamp"
},
"Profile": {
"ID": "ID number",
"ProductID": "Some Number",
"City": "City Name",
"State": "State Name",
"Number": "XXX-XXX-XXXX",
"ExtNumber": null,
"Unit": null,
"Name": "Person Name",
"Service": "Purchase",
"AddrID": "00000000",
"Products": {
"Product": [{
"Code": "CODE",
"Description": "some description"
},
{
"Code": "CODE",
"Description": "some description"
},
{
"Code": "CODE",
"Description": "some description"
},
{
"Code": "CODE",
"Description": "some description"
},
{
"Code": "CODE",
"Description": "some description"
},
{
"Code": "CODE",
"Description": "some description"
},
{
"Code": "CODE",
"Description": "some description"
},
{
"Code": "CODE",
"Description": "some description"
},
{
"Code": "CODE",
"Description": "some description"
},
{
"Code": "CODE",
"Description": "some description"
}]
}
},
"Total": {
"Amount": "some amount",
"Delivery": "some address",
"Estimate": "some amount",
"Tax": null,
"Delivery_Type": null
}
},
"partition_date":"2017-05-19"
}
我正在使用InferAvroSchema处理器获取json,并通过使用推断的avro架构将json转换为avro格式并将其发送到PutHiveStreaming处理器。 My Flow看起来像这样:
主要目标是我希望将所有“Raw_Json”列转储到hive表中的一列中,并且该表将由“partition_date”列分区,该列将是表的第二列。问题是,出于某种原因,NiFi在从“Raw_Json”列推断嵌套json时遇到问题,并将其像Null一样转储到表中,如下所示:
有谁知道如何让NiFi将“Raw_Json”列的整个嵌套Json作为一列读取并将其发送到hive表?我怎么能为它创建自己的avro架构呢?任何有关如何解决此问题的见解或想法将不胜感激!
答案 0 :(得分:2)
通常,只要您的输入文件格式始终相同,您就必须创建或生成(推断)avro架构一次 - 两个字段Raw_Json
和partition_date
。
你应该在文件中有这样的东西,例如avro-schema.json
:
{
"type" : "record",
"name" : "test",
"fields" : [ {
"name" : "Raw_Json",
"type" :
...
}, {
"name" : "partition_date",
"type" : "string",
"doc" : "Type inferred from '\"2017-05-19\"'"
} ]
}
并将此文件用作Record Schema
处理器中的ConvertJSONToAvro
。
列Raw_Json
的类型:
或者您必须使用所有嵌套字段,数组等完全定义复杂数据类型。
或者如果您想将Raw_Json
的内容写入字符串列,则必须在将文件转换为avro之前将其转换为字符串。您可以使用EvaluateJsonPath
和AttributesToJson
处理器的序列。