BigQuery:针对大型资料集执行查询

时间:2018-07-24 03:17:46

标签: google-cloud-platform google-bigquery

我有大约100TB的数据需要通过运行查询来回填以转换字段,然后将转换写入另一个表。该表按摄取时间时间戳进行分区。正如您在下面看到的,我将两个操作都作为单个查询的一部分。我打算通过摄取时间戳范围手动以较小的块多次运行此查询。

是否有比在手动块中运行查询更好的方法来处理此过程?例如,可能使用Dataflow或其他框架。

def getMax4(numbers):
    highest = []
    if len(numbers) == 0:
        return -999
    else:
        for number in numbers:
            if number % 4 == 0:
                if highest==[] or number > highest:
                    highest = number             
        return highest if highest!=[] else 0

1 个答案:

答案 0 :(得分:1)

You need to accurately dose the queries you run as there are very limiting quote enforcement.

Partitioned tables

  • Maximum number of partitions per partitioned table — 4,000

  • Maximum number of partitions modified by a single job — 2,000

  • Each job operation (query or load) can affect a maximum of 2,000 partitions. Any query or load job that affects more than 2,000 partitions is rejected by Google BigQuery.

  • Maximum number of partition modifications per day per table — 5,000 You are limited to a total of 5,000 partition modifications per day for a partitioned table. A partition can be modified by using an operation that appends to or overwrites data in the partition. Operations that modify partitions include: a load job, a query that writes results to a partition, or a DML statement (INSERT, DELETE, UPDATE, or MERGE) that modifies data in a partition.

  • More than one partition may be affected by a single job. For example, a DML statement can update data in multiple partitions (for both ingestion-time and partitioned tables). Query jobs and load jobs can also write to multiple partitions but only for partitioned tables. Google BigQuery uses the number of partitions affected by a job when determining how much of the quota the job consumes. Streaming inserts do not affect this quota.

  • Maximum rate of partition operations — 50 partition operations every 10 seconds

Most of the time you hit the second limitation, single job no more than 2000, and if you parallelise further you hit the last one, 50 partition operations every 10 seconds.

On the other hand the DML MERGE syntax could come into your help.

If you have a sales representative reach out to the BQ team and if they can increase some of your quotas they will respond positive.

Also I've seen people using multiple projects to run jobs past of the quotas.