Question

所有

我正在获得一些Python 2.7 BiqQuery（BQ）数据加载＆＃34;操作就绪＆＃34;而且我正在努力以正确的方式捕获文件加载错误，其方式与我过去使用的其他大数据DW平台相当。

在BQ中，我可以从中访问错误，示例如下：bigquery_client.load_table_from_uri.errors

{
    {'reason': 'invalid', 
        'message': "Could not parse 'r2501' as int for field lineNum (position 0) starting at location 56708 ", 
        'location': 'gs://bucketNameHere/fake-data.csv'} 
    {'reason': 'invalid', 
        'message': 'CSV table references column position 2, but line starting at position:56731 contains only 2 columns.', 
        'location': 'gs://bucketNameHere/fake-data.csv'}
    {'reason': 'invalid', 
        'message': "Could not parse 'a' as int for field lineNum (position 0) starting at location 56734 ", 
        'location': 'gs://bucketNameHere/fake-data.csv'}
    {'reason': 'invalid', 
        'message': "Could not parse 'a' as int for field lineNum (position 0) starting at location 56739 ", 
        'location': 'gs://bucketNameHere/fake-data.csv'}
    {'reason': 'invalid', 
        'message': 'CSV table references column position 1, but line starting at position:56751 contains only 1 columns.', 
        'location': 'gs://bucketNameHere/fake-data.csv'}
}

这很好，但我确实需要一些更好的信息，特别是错误的行号，这是我遇到的主要问题。

在Redshift中：stl_loaderror_detail＆amp; stl_load_errors http://docs.aws.amazon.com/redshift/latest/dg/r_STL_LOAD_ERRORS.html

在SnowflakeDB中：load_history＆amp; TABLE（VALIDATE（table_name，job_id =＆gt;＆＃39; _last＆＃39;））; https://docs.snowflake.net/manuals/sql-reference/functions/validate.html

总之，我需要加载我能够的数据（设置我的max_bad_records相当高），当记录失败时，我需要知道：

加载的文件名（如果我加载了通配符文件），这个目前提供
发生错误的行号，目前尚未提供，但字节＃嵌入在消息中 - ＆＃34;从位置＆＃34开始;或＆＃34;位置：＆＃34;。我真的需要行号作为独立元素
错误消息，提供此消息并且当前消息不仅仅是adequet

任何指导都将不胜感激。

谢谢，最好...... Rich

P.S。我将跟进包含我的加载脚本的评论，我认为我抓住统计数据的方式可能对人们有所帮助，因为我花了一段时间才弄明白。

p.s.s。在Linux上运行并且设置了GOOGLE_APPLICATION_CREDENTIALS python 2.7

库版本如下：

google-cloud==0.29.0 
google-cloud-bigquery==0.28.0
google-cloud-core==0.28.0




# load a table to bq from gcs with the schema
def load_table_from_gcs(dataset_name, table_name, schema, source, skip_leading_rows=1, source_format='CSV', max_bad_records=0, write_disposition='WRITE_EMPTY', project=None):
    try:

        # convert the schema json string to a list
        schemaList = convert_schema(schema)

        bigquery_client = bigquery.Client(project=project)
        dataset_ref = bigquery_client.dataset(dataset_name)
        table_ref = dataset_ref.table(table_name)
        table = bigquery.Table(table_ref, schema=schemaList)

        bigquery_client.create_table(table)

        job_id_prefix = "bqTools_load_job"
        job_config = bigquery.LoadJobConfig()
        job_config.create_disposition = 'NEVER'
        job_config.skip_leading_rows = skip_leading_rows
        job_config.source_format = source_format
        job_config.write_disposition = write_disposition

        if max_bad_records:
            job_config.max_bad_records = max_bad_records

        load_job = bigquery_client.load_table_from_uri(
            source, table_ref, job_config=job_config,
            job_id_prefix=job_id_prefix)

        # the following waits for table load to complete
        load_job.result()

        print("------ load_job\n")
        print("load_job: " + str(type(load_job)))
        print(dir(load_job))

        print("------ load_job.result\n")
        job_result = load_job.result
        print("job_result: " + str(type(job_result)))
        print(job_result)

        job_exception = load_job.exception
        job_id = load_job.job_id
        job_state = load_job.state
        error_result = load_job.error_result
        job_statistics = load_job._job_statistics()
        badRecords = job_statistics['badRecords']
        outputRows = job_statistics['outputRows']
        inputFiles = job_statistics['inputFiles']
        inputFileBytes = job_statistics['inputFileBytes']
        outputBytes = job_statistics['outputBytes']

        print("\n ***************************** ")
        print(" job_state:      " + str(job_state))
        print(" error_result:   " + str(error_result))
        print(" job_id:         " + str(job_id))
        print(" badRecords:     " + str(badRecords))
        print(" outputRows:     " + str(outputRows))
        print(" inputFiles:     " + str(inputFiles))
        print(" inputFileBytes: " + str(inputFileBytes))
        print(" outputBytes:    " + str(outputBytes))
        print(" type(job_exception):  " + str(type(job_exception)))
        print(" job_exception:  " + str(job_exception))
        print(" ***************************** ")

        print("------ load_job.errors \n")
        myErrors = load_job.errors
        # print("myErrors: " + str(type(myErrors)))
        for errorRecord in myErrors:
            print(errorRecord)

        print("------ ------ ------ ------\n")

        # TODO:  need to figure out how to get # records failed, and which ones they are
        # research shoed "statistics.load_job" - but not sure how that works

        returnMsg = 'load_table_from_gcs {}:{} {}'.format(dataset_name, table_name, source)

        return returnMsg

    except Exception as e:
        errorStr = 'ERROR (load_table_from_gcs): ' + str(e)
        print(errorStr)
        raise

Answer 1

BigQuery没有报告错误行号的原因是因为文件被许多工作者并行拆分和解析。假设一个工人负责文件的偏移10000~20000，它将寻求10000并从那里开始解析。当它无法解析一条线时，它只知道该线的起始偏移量。要知道它需要从文件开头扫描的行号。

您可以找到给定起始偏移的线。是否有特定原因需要行号？

Python 2.7＆amp; GCP Google BigQuery：捕获文件加载错误？

1 个答案: