我正在尝试通过GoogleCloudStorageToBigQueryOperator任务将CSV文件从Google云端存储加载到空的Google Big Query表中。
t8 = GoogleCloudStorageToBigQueryOperator(
task_id='gcs_send_dim_report',
bucket='report',
source_objects=[
'gs://report/test-dim-report/dim_report_{{ ds_nodash }}.csv'
],
schema_fields=['filename_pdf','filename_png', 'week_date', 'code'],
skip_leading_rows=1,
source_format = 'CSV',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE',
destination_project_dataset_table='xxxx-yyyy:report.test_dim_report_{{ ds_nodash }}',
dag=dag
)
要加载的表已经具有在Big Query中定义的架构,即使如此,为了解决此错误,我在正在使用的CSV列中添加了参数schema_fields
。查看任务日志,我首先遇到以下依赖项错误:
from google.appengine.api import memcache
[2018-06-22 05:58:49,650] {base_task_runner.py:98} INFO - Subtask: ImportError: No module named 'google.appengine'
[2018-06-22 05:58:49,650] {base_task_runner.py:98} INFO - Subtask:
[2018-06-22 05:58:49,651] {base_task_runner.py:98} INFO - Subtask: During handling of the above exception, another exception occurred:
[2018-06-22 05:58:49,651] {base_task_runner.py:98} INFO - Subtask:
[2018-06-22 05:58:49,651] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-06-22 05:58:49,652] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python3.5/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 33, in <module>
[2018-06-22 05:58:49,652] {base_task_runner.py:98} INFO - Subtask: from oauth2client.contrib.locked_file import LockedFile
[2018-06-22 05:58:49,652] {base_task_runner.py:98} INFO - Subtask: ImportError: No module named 'oauth2client.contrib.locked_file'
在日志末尾,显示最终错误:
Exception: BigQuery job failed. Final error was: {'reason': 'invalid', 'message': 'Empty schema specified for the load job. Please specify a schema that describes the data being loaded.'}.
我正在寻找该错误的解决方法,以便将CSV文件成功加载到Google Big Query
答案 0 :(得分:2)
双向实现这一目标。这都是来自代码文档,也是这首位:
要在BigQuery表中使用的架构可以在以下任一格式中指定 两种方式。您可以直接传入架构字段,也可以 将操作员指向Google云存储对象名称。中的对象 Google云存储必须是其中包含架构字段的JSON文件。
schema_fields
,如GoogleCloudStorageToBigQueryOperator
的文档所示。可以在以下位置找到如何定义架构的示例:https://cloud.google.com/bigquery/docs/schemas 如果已设置,则在此处定义的架构字段列表: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load 当source_format为“ DATASTORE_BACKUP”时,不应设置该值。
示例(来自示例链接):
schema = [
bigquery.SchemaField('full_name', 'STRING', mode='REQUIRED'),
bigquery.SchemaField('age', 'INTEGER', mode='REQUIRED'),
]
schema_object
。如果设置,则是指向.json文件的GCS对象路径,该文件 包含表的架构。 (已模板化)
答案 1 :(得分:1)
如dboshardy所述,tobi6提供的答案导致以下错误:
错误-'SchemaField'类型的对象不可JSON序列化
由于错误提示,SchemaField不是JSON可序列化的类,而参数schema_fields需要一个JSON可序列化的对象。
将架构作为字典列表传递示例(基于OP问题)
schema = [
{"name": "filename_pdf", "type": "STRING", "mode": "REQUIRED"},
{"name": "filename_png", "type": "STRING", "mode": "REQUIRED"},
{"name": "week_date", "type": "DATE", "mode": "REQUIRED"},
{"name": "code", "type": "INTEGER", "mode": "NULLABLE"}
]
提供的解决方案已通过Google Cloud Composer(airflow v1.10.6)成功测试了类似问题。