如何使用python将非分区表复制到bigquery中的摄取时间分区表中?

时间:2021-03-22 13:45:08

标签: python google-bigquery partitioning

用例如下: 我们有一个表 foo,它的数据每天都被替换。我们希望开始将旧数据保存在一个名为 foo_HIST 的基于历史摄取时间分区的表中。

我有以下 google-cloud bigquery 代码:1.6.1

bq_client = bigquery.Client(project=env_conf.gcp_project_id)
dataset = bigquery.dataset.DatasetReference(
    env_conf.gcp_project_id, env_conf.bq_dataset
)

full_table_src = table_conf.table_name()
table_src = dataset.table(full_table_src)
table_dst_name = f"{full_table_src}_HIST"
table_dst = dataset.table(table_dst_name)
table_dst.time_partitioning = bigquery.TimePartitioning(
    type_=bigquery.TimePartitioningType.HOUR,
)

# Truncate per partition.
job_config = bigquery.CopyJobConfig(
    create_disposition="CREATE_IF_NEEDED",
    write_disposition="WRITE_TRUNCATE",
)

job = bq_client.copy_table(table_src, table_dst, job_config=job_config)

确实创建了新表,但是当我用bq cli检查它时,它似乎不是基于分区的表。这是输出。

bq show --format=prettyjson dataset_id.foo_HIST

{
  "creationTime": "1616418131814",
  "etag": "iqfdDzv2ifdsfERfwTiFjQ==",
  "id": "project_id:dataset_id.foo_HIST",
  "kind": "bigquery#table",
  "lastModifiedTime": "1616418131814",
  "location": "EU",
  "numBytes": "32333",
  "numLongTermBytes": "0",
  "numRows": "406",
  "schema": {
    "fields": [
      {
        "mode": "NULLABLE",
        "name": "MPG",
        "type": "FLOAT"
      },
    ]
  },
  "selfLink": "https://bigquery.googleapis.com/bigquery/v2/projects/project_id/datasets/dataset_id/tables/foo_HIST",
  "tableReference": {
    "datasetId": "dataset_id",
    "projectId": "project_id",
    "tableId": "foo_HIST"
  },
  "type": "TABLE"
}

1 个答案:

答案 0 :(得分:1)

对于想知道如何在 python 中将非分区表复制到分区表(如果需要,创建它)的任何人:

CopyJob 相反,QueryJob 似乎不支持开箱即用。下面是使用 QueryJob 的最后一个片段:

    bq_client = bigquery.Client(project=gcp_project_id)
    dataset = bigquery.dataset.DatasetReference(
        gcp_project_id, dataset_id
    )

    table_src = dataset.table(table_name)
    table_dst_name = f"{table_name}_HIST"
    table_dst = dataset.table(table_dst_name)
    query = f"""
    SELECT *
    FROM `{project_id}`.dataset_id:table_name
    """

    job_config = bigquery.QueryJobConfig(
        create_disposition="CREATE_IF_NEEDED",
        write_disposition="WRITE_APPEND",
        time_partitioning=bigquery.TimePartitioning(
            type_=bigquery.TimePartitioningType.HOUR,
        ),
        use_legacy_sql=False,
        allow_large_results=True,
        destination=table_dst,
    )
    job = bq_client.query(query, job_config=job_config)
    job.result()  # Wait for job to finish
相关问题