在pyspark中的脚本之后,我尝试使用google provided connector将我的数据帧保存到BigQuery中。虽然它在<1mn行中平稳运行,但在运行3mn行时会返回错误(尽管数据结构完全相同)。
我的代码遵循谷歌示例(但修改为我的项目/数据集/数据框):
#[START bigquery export]
# Output Parameters
output_dataset = 'product_recommendation'
output_table = 'spark_ALS_recommendations'
# Get Directory for output
output_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_output'.format(bucket)
# Delete content if already existing
output_path = sc._jvm.org.apache.hadoop.fs.Path(output_directory)
output_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(output_path, True)
# Stage data formatted as newline-delimited JSON in Google Cloud Storage.
partitions = range(RddToSave.getNumPartitions())
output_files = [output_directory + '/part-{:05}'.format(i) for i in partitions]
(RddToSave
.map(lambda (c, s, p): json.dumps({'customer': c, 'sku_id': s, 'prediction': p}))
.saveAsTextFile(output_directory))
# Shell out to bq CLI to perform BigQuery import.
subprocess.check_call(
'bq load --source_format NEWLINE_DELIMITED_JSON '
'--schema customer:STRING,sku_id:STRING,prediction:FLOAT '
'{dataset}.{table} {files}'.format(
**dataset=output_dataset, table=output_table, files=','.join(output_files)**
).split())
# Manually clean up the staging_directories, otherwise BigQuery
# files will remain indefinitely.
input_path = sc._jvm.org.apache.hadoop.fs.Path(input_directory)
input_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(input_path, True)
output_path = sc._jvm.org.apache.hadoop.fs.Path(output_directory)
output_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(
output_path, True)
#[END bigquery export]
错误如下:
16/08/18 18:09:29 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://dataproc-3a3edead-12ce-4609-ab1f-ca57b20e4da9-us/hadoop/tmp/bigquery/pyspark_output/_temporary/0/_temporary/attempt_201608181808_0543_m_000017_9912/
16/08/18 18:09:29 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://dataproc-3a3edead-12ce-4609-ab1f-ca57b20e4da9-us/hadoop/tmp/bigquery/pyspark_output/_temporary/0/_temporary/attempt_201608181808_0543_m_000641_10536/
16/08/18 18:09:29 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://dataproc-3a3edead-12ce-4609-ab1f-ca57b20e4da9-us/hadoop/tmp/bigquery/pyspark_output/_temporary/0/_temporary/attempt_201608181808_0543_m_000118_10013/
16/08/18 18:09:29 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Supplementing missing matched StorageResourceId: gs://dataproc-3a3edead-12ce-4609-ab1f-ca57b20e4da9-us/hadoop/tmp/bigquery/pyspark_output/_temporary/0/_temporary/attempt_201608181808_0543_m_000017_9912/
16/08/18 18:09:29 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Supplementing missing matched StorageResourceId: gs://dataproc-3a3edead-12ce-4609-ab1f-ca57b20e4da9-us/hadoop/tmp/bigquery/pyspark_output/_temporary/0/_temporary/attempt_201608181808_0543_m_000641_10536/
16/08/18 18:09:29 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Supplementing missing matched StorageResourceId: gs://dataproc-3a3edead-12ce-4609-ab1f-ca57b20e4da9-us/hadoop/tmp/bigquery/pyspark_output/_temporary/0/_temporary/attempt_201608181808_0543_m_000118_10013/
Traceback (most recent call last):
File "/tmp/5b991dda-2b91-46e9-b21e-12ebfd8f5363/product_recommendation_mllib_v2.py", line 308, in <module>
dataset=output_dataset, table=output_table, files=','.join(output_files)
File "/usr/lib/python2.7/subprocess.py", line 535, in check_call
retcode = call(*popenargs, **kwargs)
File "/usr/lib/python2.7/subprocess.py", line 522, in call
return Popen(*popenargs, **kwargs).wait()
File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
OSError: [Errno 7] Argument list too long
16/08/18 18:09:49 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/08/18 18:09:49 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
Job output is complete
鉴于它之前的顺利运作,我不确定是否应该放弃上传到BigQuery(并找到解决方法)或者如果这是一个我可以解决的错误。 wdyt?
答案 0 :(得分:1)
全部,找到答案:在上传到BigQuery时,我无法在我的rdd中保留重新分区:即它被设置为80个分区,我需要将其恢复为1并且它完全有效。
猜测使用&gt; 1个分区,代码会在同一时间调用BigQuery的多个并行命令行,从而导致错误(tbd)