Question

问题

我是否可以使用更有效的方法来简化从python脚本或任何其他方式将csv文件上传到bigquery的过程？

描述

我有1528596个CSV文件，需要上传到bigquery [表格已经创建]。我当前的方法被证明是slow，我认为这是由谷歌bigquery upload quotas引起的。超过quota会给出以下例外情况：

Traceback (most recent call last):
  File “name_of_file.py", line 220, in <module>
  File "name_of_file.py", line 122, in upload_csv_to_bigquery
    job.result()  # Waits for table load to complete.
  File "/home/bongani/.local/lib/python3.6/site-packages/google/cloud/bigquery/job.py", line 660, in result
    return super(_AsyncJob, self).result(timeout=timeout)
  File "/home/bongani/.local/lib/python3.6/site-packages/google/api_core/future/polling.py", line 120, in result
    raise self._exception
google.api_core.exceptions.Forbidden: 403 Quota exceeded: Your project exceeded quota for imports per project. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors

我已通过电子邮件发送谷歌支持以尝试增加配额，但他们回复并说，他们无法支持。

我目前的实施：

import os
import time
from concurrent.futures import ProcessPoolExecutor, as_completed

from google.cloud import bigquery
from google.cloud.bigquery import LoadJobConfig

root_dir = "/path/to/some/directory"
dataset_id = 'dataset_namex'

bigquery_client = bigquery.Client()


def upload_csv_to_bigquery(table_name, csv_full_path):
    s = time.time()
    load_config = LoadJobConfig()
    load_config.skip_leading_rows = 1
    table_ref = bigquery_client.dataset(dataset_id).table(table_name)
    with open(csv_full_path, 'rb') as source_file:
        job = bigquery_client.load_table_from_file(source_file, table_ref, job_config=load_config)  # API request
        job.result()  # Waits for table load to complete.
    print(f"upload time: {time.time() - s}")


def run():
    with ProcessPoolExecutor(max_workers=30) as process_executor:
        futures = []
        for csvfile in os.listdir(root_dir):
            table_name = csvfile.split('_')[-1]
            futures.append(process_executor.submit(upload_csv_to_bigquery, table_name, root_dir + csvfile))
        for future in as_completed(futures):
            future.result()
    print("DONE!!!")


run()

此图显示了我每秒上传的请求数。 Metrics from Google Cloud Platform

Answer 1

制作脚本以逐行读取CSV，并使用流式插入上传它们。流媒体内容limit为100k行/秒或每秒100MB，无论您是第一次到达。

bigquery.tabledata.insertAll对API调用的数量没有速率限制，因此上传大量小文件可以让您达到bigquery.tables.insert的配额。

什么是将csv文件从VM上传到bigquery的有效方法

1 个答案: