我尝试创建从Google Cloud Storage加载数据的自动化流程,并使用CloudFunctions和Python / pandas写入Google BigQuery。
我已经开发了适用于Google Colab的代码。该脚本会上传到BQ,然后将文件移动到子存储桶中。但是,当我尝试将此代码移到Google Cloud函数中时,当我尝试写入BQ时失败:
from google.cloud import storage
import re
import pandas as pd
from io import BytesIO
import pandas_gbq as gbq
from datetime import datetime
from google.cloud import bigquery as bq
project_id = 'project_id'
# BQ
destination_table = 'Dataset.Table'
fail_blob = "failure/"
success_blob = "success/"
# GCS
trigger_bucket = 'bucket'
pattern = re.compile(r"\w+(.csv)")
re_date = re.compile('([0-9]+-[0-9]+-[0-9]+)')
#def write_to_bq(data="", context=""):
# storage
storage_client = storage.Client(project=project_id)
# storage_client = storage.Client()
bucket = storage_client.get_bucket(trigger_bucket)
blobs = bucket.list_blobs()
for blob in blobs:
if '/' not in blob.name and 'risks' in blob.name: # only looks in trigger_bucket
fn = blob.name # fn - filename
try:
print('---> Processing:\t %s' % fn)
dt = re_date.search(fn).group(1) # date from fn
csv = blob.download_as_string() # load csv as string
# Create df
df = pd.read_csv(BytesIO(csv), low_memory=False) # create df
df.columns = [x.replace(' ', '_') for x in df.columns] # remove spaces in columns
df['dt'] = dt # create column with date from file name
# Create schema
schema = [{'name': 'initiator', 'type': 'STRING'},
{'name': 'owner', 'type': 'STRING'},
{'name': 'title', 'type': 'STRING'},
{'name': 'date', 'type': 'STRING'},
]
# rename columns
df.columns = ['initiator', 'owner', 'title', 'date']
# Append data to BQ - make sure there are the same columns
df.to_gbq(destination_table, project_id,
if_exists='append', table_schema=schema)
# when uploaded move to success
bucket.rename_blob(blob, success_blob + fn )
except:
print('\t\tFailed %s' % fn)
# when failed
bucket.rename_blob(blob, fail_blob + fn )
因此,问题就在下面:df.to_gbq(destination_table,project_id,if_exists ='append',table_schema = schema) 但是,我不知道如何替换它,或者我是否需要在此处添加其他内容。
代码需要使用熊猫,因为我在将文件加载到BQ之前修改了文件的内容。
我的要求。txt:
google-cloud-storage==1.13.0
google-cloud-bigquery
pandas==0.23.4
pandas-gbq==0.8.0
或者,我可以保存回GCS。另一个CloudFunction可以将其上传到BQ。