Question

我开发了一个Scala Spark应用程序，使用Spotify的spark-bigquery连接器将数据直接流式传输到Google BigQuery。

在本地工作正常，我已根据此处https://github.com/spotify/spark-bigquery

的说明配置了我的应用程序

val ssc = new StreamingContext(sc, Seconds(120))
val sqlContext = new SQLContext(sc)
sqlContext.setGcpJsonKeyFile("/opt/keyfile.json")
sqlContext.setBigQueryProjectId("projectid")
sqlContext.setBigQueryGcsBucket("gcsbucketname")
sqlContext.setBigQueryDatasetLocation("US")

但是当我在YARN群集上的Spark上提交应用程序时，作业无法查找GOOGLE_APPLICATION_CREDENTIALS环境变量......

The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials.

我将变量设置为root用户的OS env var到包含所需凭据的.json文件，但它仍然失败。

我也试过以下一行

System.setProperty("GOOGLE_APPLICATION_CREDENTIALS", "/opt/keyfile.json")

没有成功。

对我失踪的事情有所了解吗？

谢谢，

莱昂纳多

Answer 1

文档建议：＆＃34;需要使用conf / spark-defaults.conf文件中的spark.yarn.appMasterEnv。[EnvironmentVariableName]属性设置环境变量。在spark-env.sh中设置的环境变量不会在集群模式下反映在YARN Application Master进程中。＆＃34;

Spark上的YARN和spark-bigquery连接器

1 个答案: