所有
我正在获得一些Python 2.7 BiqQuery(BQ)数据加载"操作就绪"而且我正在努力以正确的方式捕获文件加载错误,其方式与我过去使用的其他大数据DW平台相当。
在BQ中,我可以从中访问错误,示例如下:bigquery_client.load_table_from_uri.errors
{
{'reason': 'invalid',
'message': "Could not parse 'r2501' as int for field lineNum (position 0) starting at location 56708 ",
'location': 'gs://bucketNameHere/fake-data.csv'}
{'reason': 'invalid',
'message': 'CSV table references column position 2, but line starting at position:56731 contains only 2 columns.',
'location': 'gs://bucketNameHere/fake-data.csv'}
{'reason': 'invalid',
'message': "Could not parse 'a' as int for field lineNum (position 0) starting at location 56734 ",
'location': 'gs://bucketNameHere/fake-data.csv'}
{'reason': 'invalid',
'message': "Could not parse 'a' as int for field lineNum (position 0) starting at location 56739 ",
'location': 'gs://bucketNameHere/fake-data.csv'}
{'reason': 'invalid',
'message': 'CSV table references column position 1, but line starting at position:56751 contains only 1 columns.',
'location': 'gs://bucketNameHere/fake-data.csv'}
}
这很好,但我确实需要一些更好的信息,特别是错误的行号,这是我遇到的主要问题。
在Redshift中:stl_loaderror_detail& stl_load_errors http://docs.aws.amazon.com/redshift/latest/dg/r_STL_LOAD_ERRORS.html
在SnowflakeDB中:load_history& TABLE(VALIDATE(table_name,job_id =>' _last')); https://docs.snowflake.net/manuals/sql-reference/functions/validate.html
总之,我需要加载我能够的数据(设置我的max_bad_records相当高),当记录失败时,我需要知道:
任何指导都将不胜感激。
谢谢,最好...... Rich
P.S。我将跟进包含我的加载脚本的评论,我认为我抓住统计数据的方式可能对人们有所帮助,因为我花了一段时间才弄明白。
p.s.s。 在Linux上运行并且设置了GOOGLE_APPLICATION_CREDENTIALS python 2.7
库版本如下:
google-cloud==0.29.0
google-cloud-bigquery==0.28.0
google-cloud-core==0.28.0
# load a table to bq from gcs with the schema
def load_table_from_gcs(dataset_name, table_name, schema, source, skip_leading_rows=1, source_format='CSV', max_bad_records=0, write_disposition='WRITE_EMPTY', project=None):
try:
# convert the schema json string to a list
schemaList = convert_schema(schema)
bigquery_client = bigquery.Client(project=project)
dataset_ref = bigquery_client.dataset(dataset_name)
table_ref = dataset_ref.table(table_name)
table = bigquery.Table(table_ref, schema=schemaList)
bigquery_client.create_table(table)
job_id_prefix = "bqTools_load_job"
job_config = bigquery.LoadJobConfig()
job_config.create_disposition = 'NEVER'
job_config.skip_leading_rows = skip_leading_rows
job_config.source_format = source_format
job_config.write_disposition = write_disposition
if max_bad_records:
job_config.max_bad_records = max_bad_records
load_job = bigquery_client.load_table_from_uri(
source, table_ref, job_config=job_config,
job_id_prefix=job_id_prefix)
# the following waits for table load to complete
load_job.result()
print("------ load_job\n")
print("load_job: " + str(type(load_job)))
print(dir(load_job))
print("------ load_job.result\n")
job_result = load_job.result
print("job_result: " + str(type(job_result)))
print(job_result)
job_exception = load_job.exception
job_id = load_job.job_id
job_state = load_job.state
error_result = load_job.error_result
job_statistics = load_job._job_statistics()
badRecords = job_statistics['badRecords']
outputRows = job_statistics['outputRows']
inputFiles = job_statistics['inputFiles']
inputFileBytes = job_statistics['inputFileBytes']
outputBytes = job_statistics['outputBytes']
print("\n ***************************** ")
print(" job_state: " + str(job_state))
print(" error_result: " + str(error_result))
print(" job_id: " + str(job_id))
print(" badRecords: " + str(badRecords))
print(" outputRows: " + str(outputRows))
print(" inputFiles: " + str(inputFiles))
print(" inputFileBytes: " + str(inputFileBytes))
print(" outputBytes: " + str(outputBytes))
print(" type(job_exception): " + str(type(job_exception)))
print(" job_exception: " + str(job_exception))
print(" ***************************** ")
print("------ load_job.errors \n")
myErrors = load_job.errors
# print("myErrors: " + str(type(myErrors)))
for errorRecord in myErrors:
print(errorRecord)
print("------ ------ ------ ------\n")
# TODO: need to figure out how to get # records failed, and which ones they are
# research shoed "statistics.load_job" - but not sure how that works
returnMsg = 'load_table_from_gcs {}:{} {}'.format(dataset_name, table_name, source)
return returnMsg
except Exception as e:
errorStr = 'ERROR (load_table_from_gcs): ' + str(e)
print(errorStr)
raise
答案 0 :(得分:1)
BigQuery没有报告错误行号的原因是因为文件被许多工作者并行拆分和解析。假设一个工人负责文件的偏移10000~20000,它将寻求10000并从那里开始解析。当它无法解析一条线时,它只知道该线的起始偏移量。要知道它需要从文件开头扫描的行号。
您可以找到给定起始偏移的线。是否有特定原因需要行号?