Question

我有以下代码：

job_config = bigquery.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
job_config.destination_format = (
    bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON)

destination_uri = 'gs://{}/{}'.format(bucket_name, gcs_filename)

extract_job = client.extract_table(
    table,
    destination_uri,
    job_config=job_config,
    location='US')  # API request
extract_job.result()  # Waits for job to complete.

（请注意，我在别处获取我的表格对象。）

这样可行，并将请求的表作为换行符分隔的JSON转储到GCS中。但是，表中的某些列可以为空，其中一些确实包含空值。为了保持所有数据的一致性，我想在json结果中保留空值。有没有办法这样做而不必使用avro？

这篇帖子：Big Query table extract in JSON, preserve nulls? ...建议实际查询表格。我不认为这是我的选择，因为我提取的表每个包含数百万行。我正在观看的内容包含近100M行，重量超过25GB。但我还没有找到一种方法来设置提取作业以保留空值。

Answer 1

我认为最好的方法是首先使用查询作业。

您的表从某处提取并运行查询作业
以无标题

有代码执行此操作

job_config = bigquery.QueryJobConfig()
gcs_filename = 'file_with_nulls*.json.gzip'

table_ref = client.dataset(dataset_id).table('my_null_table')
job_config.destination = table_ref

job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE

# Start the query, passing in the extra configuration.
query_job = client.query(
    """#standardSql
    select TO_JSON_STRING(t) AS json from `project.dataset.table` as t ;""",
    location='US',
    job_config=job_config)

while not query_job.done():
    time.sleep(1)

#check if table successfully written
print("query completed")
job_config = bigquery.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
job_config.destination_format = (
    bigquery.DestinationFormat.CSV)
job_config.print_header = False

destination_uri = 'gs://{}/{}'.format(bucket_name, gcs_filename)

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    job_config=job_config,
    location='US')  # API request
extract_job.result()
print("extract completed")

完成后 - 您可以删除在步骤1中创建的临时表。如果快速完成，成本将非常低（每月1TB存储空间为20美元 - 因此，即使1小时，25GB也将是20/30/24 = 3美分）

Answer 2

以前在SO中提出过该论点。建议您查看this post，包括问题的解释和解决方法。

有一些很好的答案，例如Mosha（Google软件工程师）的答案：

这是SQL和所有SQL数据库中NULL的标准行为（Oracle，Microsoft SQL Server，PostgreSQL，MySQL等）完全具有同样的行为。如果IS NULL检查过于繁琐，请选择解决方案是使用IFNULL或COALESCE函数将NULL转换为非NULL，即
select * from
(select NULL as some_nullable_col, "name1" as name),
(select 4 as some_nullable_col, "name2" as name),
(select 1 as some_nullable_col, "name3" as name),
(select 7 as some_nullable_col, "name4" as name),
(select 3 as some_nullable_col, "name5" as name)
WHERE ifnull(some_nullable_col,0) != 3

BigQuery Python API：在extract_table作业期间保留空字段

2 个答案: