具有AVRO或Parquet

时间:2018-05-09 01:34:49

标签: google-bigquery avro parquet

我正在尝试将Parquet数据加载到Google BigQuery中,以利用高效的列式格式,并且(我希望)能够避免BigQuery在AVRO文件中缺乏对逻辑类型(DATE等)的支持。

我的数据包含两层嵌套数组。

使用JSON我可以创建并加载具有所需结构的表:

bq mk temp.simple_interval simple_interval_bigquery_schema.json
bq load --source_format=NEWLINE_DELIMITED_JSON temp.simple_interval ~/Desktop/simple_interval.json
bq show temp.simple_interval

   Last modified                    Schema                   Total Rows   Total Bytes   Expiration   Time Partitioning   Labels
 ----------------- ---------------------------------------- ------------ ------------- ------------ ------------------- --------
  09 May 13:21:56   |- file_name: string (required)          3            246
                    |- file_created: timestamp (required)
                    |- id: string (required)
                    |- interval_length: integer (required)
                    +- days: record (repeated)
                    |  |- interval_date: date (required)
                    |  |- quality: string (required)
                    |  +- values: record (repeated)
                    |  |  |- interval: integer (required)
                    |  |  |- value: float (required)

我尝试使用AvroParquetWriter使用Parquet数据文件创建相同的结构。我的AVRO架构是:

{
  "name": "simple_interval",
  "type": "record",
  "fields": [
    {"name": "file_name", "type": "string"},
    {"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
    {"name": "id", "type": "string"},
    {"name": "interval_length", "type": "int"},
    {"name": "days", "type": {
      "type": "array",
      "items": {
        "name": "days_record",
        "type": "record",
        "fields": [
          {"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
          {"name": "quality", "type": "string"},
          {"name": "values", "type": {
            "type": "array",
            "items": {
              "name": "values_record",
              "type": "record",
              "fields": [
                {"name": "interval", "type": "int"},
                {"name": "value", "type": "float"}
              ]
            }
          }}
        ]
      }
    }}
  ]
}

从AVRO规范,以及我在网上找到的内容,似乎有必要嵌套'记录'内部节点'阵列'这样的节点。

当我创建Parquet文件时,Parquet工具会将架构报告为:

message simple_interval {
  required binary file_name (UTF8);
  required int64 file_created (TIMESTAMP_MILLIS);
  required binary id (UTF8);
  required int32 interval_length;
  required group days (LIST) {
    repeated group array {
      required int32 interval_date (DATE);
      required binary quality (UTF8);
      required group values (LIST) {
        repeated group array {
          required int32 interval;
          required float value;
        }
      }
    }
  }
}

我将文件加载到BigQuery中并检查结果:

bq load --source_format=PARQUET temp.simple_interval ~/Desktop/simple_interval.parquet
bq show temp.simple_interval

   Last modified                      Schema                      Total Rows   Total Bytes   Expiration   Time Partitioning   Labels
 ----------------- --------------------------------------------- ------------ ------------- ------------ ------------------- --------
  09 May 13:05:54   |- file_name: string (required)               3            246
                    |- file_created: timestamp (required)
                    |- id: string (required)
                    |- interval_length: integer (required)
                    +- days: record (required)
                    |  +- array: record (repeated)           <-- extra column
                    |  |  |- interval_date: date (required)
                    |  |  |- quality: string (required)
                    |  |  +- values: record (required)
                    |  |  |  +- array: record (repeated)     <-- extra column
                    |  |  |  |  |- interval: integer (required)
                    |  |  |  |  |- value: float (required)

这是可行的,但我想知道,有没有办法避免额外的阵列&#39;中间节点/列?

我错过了什么吗?对于嵌套数组,AVRO / Parquet有没有办法像JSON那样获得更简单的BigQuery表结构?

1 个答案:

答案 0 :(得分:0)

我使用了这个avro架构:

{
  "name": "simple_interval",
  "type": "record",
  "fields": [
    {"name": "file_name", "type": "string"},
    {"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
    {"name": "id", "type": "string"},
    {"name": "interval_length", "type": "int"},
    {"name": "days", "type": {"type":"record","name":"days_", "fields": [
          {"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
          {"name": "quality", "type": "string"},
          {"name": "values", "type": {"type":"record", "name":"values_","fields": [
                {"name": "interval", "type": "int"},
                {"name": "value", "type": "float"}
          ]}}
    ]}}
  ]
}

我创建了一个空的avro文件,然后运行命令:

bq load --source_format=AVRO <dataset>.<table-name> <avro-file>.avro 

运行bq show <dataset>.<table-name>时,我得到以下信息:

 Last modified                    Schema                    Total Rows   Total Bytes   Expiration   Time Partitioning   Labels   kmsKeyName  
 ----------------- ----------------------------------------- ------------ ------------- ------------ ------------------- -------- ------------ 
  22 May 09:46:02   |- file_name: string (required)           0            0                                                                   
                    |- file_created: integer (required)                                                                                        
                    |- id: string (required)                                                                                                   
                    |- interval_length: integer (required)                                                                                     
                    +- days: record (required)                                                                                                 
                    |  |- interval_date: integer (required)                                                                                    
                    |  |- quality: string (required)                                                                                           
                    |  +- values: record (required)                                                                                            
                    |  |  |- interval: integer (required)                                                                                      
                    |  |  |- value: float (required)