我尝试在自动检测下加载带有架构的CSV文件,但我无法将文件加载到Big查询中。任何人都可以帮助我。
请在下面找到我的代码:
def load_data_from_file(dataset_name, table_name, source_file_name):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
table.reload()
with open(source_file_name, 'rb') as source_file:
job = table.upload_from_file(
source_file, source_format='text/csv')
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_name, table_name))
def wait_for_job(job):
while True:
job.reload()
if job.state == 'DONE':
if job.error_result:
raise RuntimeError(job.errors)
return
time.sleep(1)
答案 0 :(得分:2)
根据Google BigQuery python API文档,您应将source_format设置为“CSV”而不是“text / csv”:
source_format='CSV'
代码示例:
with open(csv_file.name, 'rb') as readable:
table.upload_from_file(
readable, source_format='CSV', skip_leading_rows=1)
来源:https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery-usage.html#datasets
如果这不能解决您的问题,请提供有关您正在观察的错误的更多细节。
答案 1 :(得分:0)
目前,Python客户端不支持使用模式自动检测标志从文件加载数据(我计划执行拉取请求以添加此支持,但我仍然想与维护者讨论他们的意见是什么关于这个实施)。
仍有一些方法可以解决这个问题。到目前为止,我没有找到一个非常优雅的解决方案,但是这段代码允许您添加模式检测作为输入标志:
from google.cloud.bigquery import Client
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/your/json.key'
import google.cloud.bigquery.table as mtable
def _configure_job_metadata(metadata,
allow_jagged_rows,
allow_quoted_newlines,
create_disposition,
encoding,
field_delimiter,
ignore_unknown_values,
max_bad_records,
quote_character,
skip_leading_rows,
write_disposition):
load_config = metadata['configuration']['load']
if allow_jagged_rows is not None:
load_config['allowJaggedRows'] = allow_jagged_rows
if allow_quoted_newlines is not None:
load_config['allowQuotedNewlines'] = allow_quoted_newlines
if create_disposition is not None:
load_config['createDisposition'] = create_disposition
if encoding is not None:
load_config['encoding'] = encoding
if field_delimiter is not None:
load_config['fieldDelimiter'] = field_delimiter
if ignore_unknown_values is not None:
load_config['ignoreUnknownValues'] = ignore_unknown_values
if max_bad_records is not None:
load_config['maxBadRecords'] = max_bad_records
if quote_character is not None:
load_config['quote'] = quote_character
if skip_leading_rows is not None:
load_config['skipLeadingRows'] = skip_leading_rows
if write_disposition is not None:
load_config['writeDisposition'] = write_disposition
load_config['autodetect'] = True # --> Here you can add the option for schema auto-detection
mtable._configure_job_metadata = _configure_job_metadata
bq_client = Client()
ds = bq_client.dataset('dataset_name')
ds.table = lambda: mtable.Table('table_name', ds)
table = ds.table()
with open(source_file_name, 'rb') as source_file:
job = table.upload_from_file(
source_file, source_format='text/csv')
答案 2 :(得分:0)
您可以使用以下代码段通过自动检测架构从Cloud Storage创建数据并将数据(CSV格式)加载到BigQuery:
Price = CALCULATE(SELECTEDVALUE('Comeptitor-Raw-Data-Test'[product_price]),FILTER('Comeptitor-Raw-Data-Test', WEEKNUM([product_price_time],1)))
答案 3 :(得分:-1)
只想展示我是如何使用python客户端的。
下面是我创建表并使用csv文件加载它的函数。
另外,self.client是我的bigquery.Client()
def insertTable(self, datasetName, tableName, csvFilePath, schema=None):
"""
This function creates a table in given dataset in our default project
and inserts the data given via a csv file.
:param datasetName: The name of the dataset to be created
:param tableName: The name of the dataset in which the table needs to be created
:param csvFilePath: The path of the file to be inserted
:param schema: The schema of the table to be created
:return: returns nothing
"""
csv_file = open(csvFilePath, 'rb')
dataset_ref = self.client.dataset(datasetName)
# <import>: from google.cloud.bigquery import Dataset
dataset = Dataset(dataset_ref)
table_ref = dataset.table(tableName)
if schema is not None:
table = bigquery.Table(table_ref,schema)
else:
table = bigquery.Table(table_ref)
try:
self.client.delete_table(table)
except:
pass
table = self.client.create_table(table)
# <import>: from google.cloud.bigquery import LoadJobConfig
job_config = LoadJobConfig()
table_ref = dataset.table(tableName)
job_config.source_format = 'CSV'
job_config.skip_leading_rows = 1
job_config.autodetect = True
job = self.client.load_table_from_file(
csv_file, table_ref, job_config=job_config)
job.result()
如果这可以解决您的问题,请告诉我。