使用python API将csv文件加载到Big查询自动检测模式中

时间:2017-07-06 11:15:30

标签: python google-bigquery

我尝试在自动检测下加载带有架构的CSV文件,但我无法将文件加载到Big查询中。任何人都可以帮助我。

请在下面找到我的代码:

def load_data_from_file(dataset_name, table_name, source_file_name):

    bigquery_client = bigquery.Client()
    dataset = bigquery_client.dataset(dataset_name)
    table = dataset.table(table_name)    
    table.reload()
    with open(source_file_name, 'rb') as source_file:        
        job = table.upload_from_file(
            source_file, source_format='text/csv')
    wait_for_job(job)
    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_name, table_name))
def wait_for_job(job):
    while True:
        job.reload()
        if job.state == 'DONE':
            if job.error_result:
                raise RuntimeError(job.errors)
            return
    time.sleep(1)

4 个答案:

答案 0 :(得分:2)

根据Google BigQuery python API文档,您应将source_format设置为“CSV”而不是“text / csv”:

source_format='CSV'

代码示例:

with open(csv_file.name, 'rb') as readable:
    table.upload_from_file(
        readable, source_format='CSV', skip_leading_rows=1)

来源:https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery-usage.html#datasets

如果这不能解决您的问题,请提供有关您正在观察的错误的更多细节。

答案 1 :(得分:0)

目前,Python客户端不支持使用模式自动检测标志从文件加载数据(我计划执行拉取请求以添加此支持,但我仍然想与维护者讨论他们的意见是什么关于这个实施)。

仍有一些方法可以解决这个问题。到目前为止,我没有找到一个非常优雅的解决方案,但是这段代码允许您添加模式检测作为输入标志:

from google.cloud.bigquery import Client
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/your/json.key'
import google.cloud.bigquery.table as mtable

def _configure_job_metadata(metadata,
                             allow_jagged_rows,
                             allow_quoted_newlines,
                             create_disposition,
                             encoding,
                             field_delimiter,
                             ignore_unknown_values,
                             max_bad_records,
                             quote_character,
                             skip_leading_rows,
                             write_disposition):
     load_config = metadata['configuration']['load']

     if allow_jagged_rows is not None:
         load_config['allowJaggedRows'] = allow_jagged_rows

     if allow_quoted_newlines is not None:
         load_config['allowQuotedNewlines'] = allow_quoted_newlines

     if create_disposition is not None:
         load_config['createDisposition'] = create_disposition

     if encoding is not None:
         load_config['encoding'] = encoding

     if field_delimiter is not None:
         load_config['fieldDelimiter'] = field_delimiter

     if ignore_unknown_values is not None:
         load_config['ignoreUnknownValues'] = ignore_unknown_values

     if max_bad_records is not None:
         load_config['maxBadRecords'] = max_bad_records

     if quote_character is not None:
         load_config['quote'] = quote_character

     if skip_leading_rows is not None:
         load_config['skipLeadingRows'] = skip_leading_rows

     if write_disposition is not None:
         load_config['writeDisposition'] = write_disposition
     load_config['autodetect'] = True # --> Here you can add the option for schema auto-detection

mtable._configure_job_metadata = _configure_job_metadata

bq_client = Client()
ds = bq_client.dataset('dataset_name')
ds.table = lambda: mtable.Table('table_name', ds)
table = ds.table()

with open(source_file_name, 'rb') as source_file:        
    job = table.upload_from_file(
        source_file, source_format='text/csv')

答案 2 :(得分:0)

您可以使用以下代码段通过自动检测架构从Cloud Storage创建数据并将数据(CSV格式)加载到BigQuery:

Price = CALCULATE(SELECTEDVALUE('Comeptitor-Raw-Data-Test'[product_price]),FILTER('Comeptitor-Raw-Data-Test', WEEKNUM([product_price_time],1)))

答案 3 :(得分:-1)

只想展示我是如何使用python客户端的。

下面是我创建表并使用csv文件加载它的函数。

另外,self.client是我的bigquery.Client()

def insertTable(self, datasetName, tableName, csvFilePath, schema=None):
    """
    This function creates a table in given dataset in our default project
    and inserts the data given via a csv file.

    :param datasetName: The name of the dataset to be created
    :param tableName: The name of the dataset in which the table needs to be created
    :param csvFilePath: The path of the file to be inserted
    :param schema: The schema of the table to be created
    :return: returns nothing
    """
    csv_file = open(csvFilePath, 'rb')

    dataset_ref = self.client.dataset(datasetName)
    # <import>: from google.cloud.bigquery import Dataset
    dataset = Dataset(dataset_ref)

    table_ref = dataset.table(tableName)
    if schema is not None:
        table = bigquery.Table(table_ref,schema)
    else:
        table = bigquery.Table(table_ref)

    try:
        self.client.delete_table(table)
    except:
        pass

    table = self.client.create_table(table)

    # <import>: from google.cloud.bigquery import LoadJobConfig
    job_config = LoadJobConfig()
    table_ref = dataset.table(tableName)
    job_config.source_format = 'CSV'
    job_config.skip_leading_rows = 1
    job_config.autodetect = True
    job = self.client.load_table_from_file(
        csv_file, table_ref, job_config=job_config)
    job.result()

如果这可以解决您的问题,请告诉我。