将熊猫数据框写入Google BigQuery时出现Google CloudFunctions问题

时间:2020-03-04 14:38:44

标签: python pandas google-bigquery google-cloud-functions google-cloud-storage

我尝试创建从Google Cloud Storage加载数据的自动化流程,并使用CloudFunctions和Python / pandas写入Google BigQuery。

我已经开发了适用于Google Colab的代码。该脚本会上传到BQ,然后将文件移动到子存储桶中。但是,当我尝试将此代码移到Google Cloud函数中时,当我尝试写入BQ时失败:

from google.cloud import storage
import re
import pandas as pd
from io import BytesIO
import pandas_gbq as gbq
from datetime import datetime
from google.cloud import bigquery as bq
project_id = 'project_id'

# BQ 
destination_table = 'Dataset.Table'
fail_blob = "failure/"
success_blob = "success/"

# GCS
trigger_bucket = 'bucket'
pattern = re.compile(r"\w+(.csv)")
re_date = re.compile('([0-9]+-[0-9]+-[0-9]+)')
#def write_to_bq(data="", context=""):
# storage
storage_client = storage.Client(project=project_id)
# storage_client = storage.Client()
bucket = storage_client.get_bucket(trigger_bucket)
blobs = bucket.list_blobs()

for blob in blobs:
  if '/' not in blob.name and 'risks' in blob.name:   # only looks in trigger_bucket

  fn = blob.name  # fn - filename
  try:
    print('---> Processing:\t %s' % fn)
    dt = re_date.search(fn).group(1)  # date from fn
    csv = blob.download_as_string()   # load csv as string

    # Create df
    df = pd.read_csv(BytesIO(csv), low_memory=False) # create df
    df.columns = [x.replace(' ', '_') for x in df.columns]  # remove spaces in columns
    df['dt'] = dt                                           # create column with date from file name

    # Create schema

    schema = [{'name': 'initiator', 'type': 'STRING'},
              {'name': 'owner', 'type': 'STRING'},
              {'name': 'title', 'type': 'STRING'},
              {'name': 'date', 'type': 'STRING'},
              ]
    # rename columns
    df.columns = ['initiator', 'owner', 'title', 'date']

    # Append data to BQ - make sure there are the same columns
    df.to_gbq(destination_table, project_id,
              if_exists='append', table_schema=schema)

    # when uploaded move to success
    bucket.rename_blob(blob, success_blob + fn )
  except:
    print('\t\tFailed %s' % fn)
    # when failed
    bucket.rename_blob(blob, fail_blob + fn )

因此,问题就在下面:df.to_gbq(destination_table,project_id,if_exists ='append',table_schema = schema) 但是,我不知道如何替换它,或者我是否需要在此处添加其他内容。

代码需要使用熊猫,因为我在将文件加载到BQ之前修改了文件的内容。

我的要求。txt:

google-cloud-storage==1.13.0
google-cloud-bigquery
pandas==0.23.4
pandas-gbq==0.8.0

或者,我可以保存回GCS。另一个CloudFunction可以将其上传到BQ。

0 个答案:

没有答案