在python脚本中将制表符分隔文件从gcs加载到Bigquery时出错

时间:2017-06-30 22:50:59

标签: google-bigquery

我正在使用Python脚本将文件从GCS加载到Bigquery。加载逗号分隔文件时它是默认的。但是,在设置以下作业属性后尝试加载制表符分隔文件时:

  job.allowQuotedNewlines=True
  job.fieldDelimiter='\t'
  job.skipLeadingRows=1
  job.maxBadRecords=9999999

通过以下方式插入作业:

 job.begin()

它出现以下错误:

   "errors": [
   {
    "reason": "invalid",
    "message": "Too many errors encountered."
   },
   {
    "reason": "invalid",
    "location": "gs://my-test/test-file",
    "message": "CSV table references column position 1, but line starting at position:0 contains only 1 columns."

}

它还在寻找逗号分隔符吗?看起来API中实际上没有在脚本中设置的任何属性。 我错过了什么?

以下是我尝试加载的文件中的两行:

0265cd91-3126-4f54-a7e3-54be3ef2d8f9    357215cb-c073-4e67-bfdb-7085f8709015    398a9017-1157-4891-aacb-8108c5fb6378    6bb1f59a-81bb-49da-9974-193a23cb3bca    test B  test2 B 0   2017-03-21 18:48:32 2017-03-21 18:48:32
02aa9715-e47b-4cd9-89f8-a091f7a6e81d    1186dfc3-3b2f-456a-be06-bf5f5a0f7c12    398a9017-1157-4891-aacb-8108c5fb6378    e1983ef2-d7a1-49ce-9fe2-a5cd439b8ca0    test A  test2 A 0   2017-06-26 14:37:43 2017-06-26 14:37:43

以下是" set list"之后的2行。在vim。如您所见,分隔符是" ^ I":

  0265cd91-3126-4f54-a7e3-54be3ef2d8f9^I357215cb-c073-4e67-bfdb-7085f8709015^I398a9017-1157-4891-aacb-8108c5fb6378^I6bb1f59a-81bb-49da-9974-193a23cb3bca^IRockMedium B^IRockMedium
  02aa9715-e47b-4cd9-89f8-a091f7a6e81d^I1186dfc3-3b2f-456a-be06-bf5f5a0f7c12^I398a9017-1157-4891-aacb-8108c5fb6378^Ie1983ef2-d7a1-49ce-9fe2-a5cd439b8ca0^IStairModule A^IStairModu

这是完整的代码:

  dest_dataset = "temp"
  dest_table = "lineItems_copy"
  destination = self.bq_client.dataset(dest_dataset).table(name=dest_table)
  source_files = "gs://my-test/test-*"
  job_id = "load_gcs_file_to_bq_" +  str(uuid.uuid4())
  print ("job_id= ", job_id)
  job = self.bq_client.load_table_from_storage(job_id, destination, source_files)
  job_properties = {'createDisposition': 'CREATE_NEVER', 'sourceFormat': 'CSV', 'writeDisposition': 'WRITE_APPEND'}

  #testing with tab-delimited:
  job.allowQuotedNewlines=True
  job.fieldDelimiter='\t'
  job.skipLeadingRows=1
  job.maxBadRecords=9999999
  submit_async_load_job(self.bq_client, self.cloud_logger, job, job_id, job_properties)

def submit_async_load_job(bq_client, logger, job, job_id, load_job_options): 
   for key, value in load_job_options.iteritems():
     print ("key value: " , key, " ", value)
     set_property = 'job.' + key + '="'+value+'"'
     print set_property
     exec(set_property)            
   job.begin()
   wait_for_job(logger, job, job_id)
   return

感谢您的帮助。

1 个答案:

答案 0 :(得分:1)

问题是属性名称:使用此API提交加载作业时,属性拼写不同,例如:job.fieldDelimiter ='\ t'应

 job.field_delimiter='\t'