如何将10G JSON文件转换为Avro?

时间:2015-12-16 21:40:15

标签: json avro

我有一个大约10G的JSON文件。每行只包含一个JSON文档。我想知道将此转换为Avro的最佳方法是什么。理想情况下,我希望每个文件保留几个文档(如10M)。我认为Avro支持在同一个文件中包含多个文档。

2 个答案:

答案 0 :(得分:3)

您应该可以使用Avro工具' fromjson命令(有关更多信息和示例,请参阅here)。您可能希望事先将文件拆分为10M块(例如使用split(1))。

答案 1 :(得分:0)

将大型JSON文件转换为Avro的最简单方法是使用Avro website中的avro-tools。

创建简单模式后,可以直接转换文件。

{
        "type": "record",
        "name": "cpc_schema",
        "namespace": "com.streambright.avro",
        "fields": [{
                "name": "section",
                "type": "string",
                "doc": "Section of the CPC"
        }, {
                "name": "class",
                "type": "string",
                "doc": "Class of the CPC"
        }, {
                "name": "subclass",
                "type": "string",
                "doc": "Subclass of the CPC"
        }, {
                "name": "main_group",
                "type": "string",
                "doc": "Main-group of the CPC"
        }, {
                "name": "subgroup",
                "type": "string",
                "doc": "Subgroup of the CPC"
        }, {
                "name": "classification_value",
                "type": "string",
                "doc": "Classification value of the CPC"
        }, {
                "name": "doc_number",
                "type": "string",
                "doc": "Patent doc_number"
        }, {
                "name": "updated_at",
                "type": "string",
                "doc": "Document update time"
        }],
        "doc:": "A basic schema for CPC codes"
}

示例架构:

to move-up
    let myelev [elevation] of patch-here
    let higherpatches neighbors with [elevation > myelev] 
    if any? higherpatches
    [move-to one-of higherpatches]
end