将3百万行数据帧从Spark上传到BigQuery时出错(使用Google Connector)

时间:2016-08-18 19:48:59

标签: google-bigquery pyspark google-cloud-platform

在pyspark中的脚本之后,我尝试使用google provided connector将我的数据帧保存到BigQuery中。虽然它在<1mn行中平稳运行,但在运行3mn行时会返回错误(尽管数据结构完全相同)。

我的代码遵循谷歌示例(但修改为我的项目/数据集/数据框):

#[START bigquery export]

# Output Parameters
output_dataset = 'product_recommendation'
output_table = 'spark_ALS_recommendations'

# Get Directory for output
output_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_output'.format(bucket)
# Delete content if already existing
output_path = sc._jvm.org.apache.hadoop.fs.Path(output_directory)
output_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(output_path, True)
# Stage data formatted as newline-delimited JSON in Google Cloud Storage.
partitions = range(RddToSave.getNumPartitions())
output_files = [output_directory + '/part-{:05}'.format(i) for i in partitions]

(RddToSave
 .map(lambda (c, s, p): json.dumps({'customer': c, 'sku_id': s, 'prediction': p}))
 .saveAsTextFile(output_directory))

# Shell out to bq CLI to perform BigQuery import.
subprocess.check_call(
    'bq load --source_format NEWLINE_DELIMITED_JSON '
    '--schema customer:STRING,sku_id:STRING,prediction:FLOAT '
    '{dataset}.{table} {files}'.format(
        **dataset=output_dataset, table=output_table, files=','.join(output_files)**
    ).split())

# Manually clean up the staging_directories, otherwise BigQuery
# files will remain indefinitely.
input_path = sc._jvm.org.apache.hadoop.fs.Path(input_directory)
input_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(input_path, True)
output_path = sc._jvm.org.apache.hadoop.fs.Path(output_directory)
output_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(
    output_path, True)

#[END bigquery export]

错误如下:

16/08/18 18:09:29 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://dataproc-3a3edead-12ce-4609-ab1f-ca57b20e4da9-us/hadoop/tmp/bigquery/pyspark_output/_temporary/0/_temporary/attempt_201608181808_0543_m_000017_9912/
16/08/18 18:09:29 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://dataproc-3a3edead-12ce-4609-ab1f-ca57b20e4da9-us/hadoop/tmp/bigquery/pyspark_output/_temporary/0/_temporary/attempt_201608181808_0543_m_000641_10536/
16/08/18 18:09:29 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://dataproc-3a3edead-12ce-4609-ab1f-ca57b20e4da9-us/hadoop/tmp/bigquery/pyspark_output/_temporary/0/_temporary/attempt_201608181808_0543_m_000118_10013/
16/08/18 18:09:29 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Supplementing missing matched StorageResourceId: gs://dataproc-3a3edead-12ce-4609-ab1f-ca57b20e4da9-us/hadoop/tmp/bigquery/pyspark_output/_temporary/0/_temporary/attempt_201608181808_0543_m_000017_9912/
16/08/18 18:09:29 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Supplementing missing matched StorageResourceId: gs://dataproc-3a3edead-12ce-4609-ab1f-ca57b20e4da9-us/hadoop/tmp/bigquery/pyspark_output/_temporary/0/_temporary/attempt_201608181808_0543_m_000641_10536/
16/08/18 18:09:29 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Supplementing missing matched StorageResourceId: gs://dataproc-3a3edead-12ce-4609-ab1f-ca57b20e4da9-us/hadoop/tmp/bigquery/pyspark_output/_temporary/0/_temporary/attempt_201608181808_0543_m_000118_10013/
Traceback (most recent call last):
  File "/tmp/5b991dda-2b91-46e9-b21e-12ebfd8f5363/product_recommendation_mllib_v2.py", line 308, in <module>
    dataset=output_dataset, table=output_table, files=','.join(output_files)
  File "/usr/lib/python2.7/subprocess.py", line 535, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib/python2.7/subprocess.py", line 522, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1335, in _execute_child
    raise child_exception
OSError: [Errno 7] Argument list too long
16/08/18 18:09:49 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/08/18 18:09:49 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
Job output is complete

鉴于它之前的顺利运作,我不确定是否应该放弃上传到BigQuery(并找到解决方法)或者如果这是一个我可以解决的错误。 wdyt?

1 个答案:

答案 0 :(得分:1)

全部,找到答案:在上传到BigQuery时,我无法在我的rdd中保留重新分区:即它被设置为80个分区,我需要将其恢复为1并且它完全有效。

猜测使用&gt; 1个分区,代码会在同一时间调用BigQuery的多个并行命令行,从而导致错误(tbd)