问题
我是否可以使用更有效的方法来简化从python脚本或任何其他方式将csv文件上传到bigquery的过程?
描述
我有1528596个CSV文件,需要上传到bigquery [表格已经创建]。我当前的方法被证明是slow,我认为这是由谷歌bigquery upload quotas引起的。超过quota会给出以下例外情况:
Traceback (most recent call last):
File “name_of_file.py", line 220, in <module>
File "name_of_file.py", line 122, in upload_csv_to_bigquery
job.result() # Waits for table load to complete.
File "/home/bongani/.local/lib/python3.6/site-packages/google/cloud/bigquery/job.py", line 660, in result
return super(_AsyncJob, self).result(timeout=timeout)
File "/home/bongani/.local/lib/python3.6/site-packages/google/api_core/future/polling.py", line 120, in result
raise self._exception
google.api_core.exceptions.Forbidden: 403 Quota exceeded: Your project exceeded quota for imports per project. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
我已通过电子邮件发送谷歌支持以尝试增加配额,但他们回复并说,他们无法支持。
我目前的实施:
import os
import time
from concurrent.futures import ProcessPoolExecutor, as_completed
from google.cloud import bigquery
from google.cloud.bigquery import LoadJobConfig
root_dir = "/path/to/some/directory"
dataset_id = 'dataset_namex'
bigquery_client = bigquery.Client()
def upload_csv_to_bigquery(table_name, csv_full_path):
s = time.time()
load_config = LoadJobConfig()
load_config.skip_leading_rows = 1
table_ref = bigquery_client.dataset(dataset_id).table(table_name)
with open(csv_full_path, 'rb') as source_file:
job = bigquery_client.load_table_from_file(source_file, table_ref, job_config=load_config) # API request
job.result() # Waits for table load to complete.
print(f"upload time: {time.time() - s}")
def run():
with ProcessPoolExecutor(max_workers=30) as process_executor:
futures = []
for csvfile in os.listdir(root_dir):
table_name = csvfile.split('_')[-1]
futures.append(process_executor.submit(upload_csv_to_bigquery, table_name, root_dir + csvfile))
for future in as_completed(futures):
future.result()
print("DONE!!!")
run()
此图显示了我每秒上传的请求数。 Metrics from Google Cloud Platform
答案 0 :(得分:1)
制作脚本以逐行读取CSV,并使用流式插入上传它们。流媒体内容limit为100k行/秒或每秒100MB,无论您是第一次到达。
bigquery.tabledata.insertAll
对API调用的数量没有速率限制,因此上传大量小文件可以让您达到bigquery.tables.insert
的配额。