我有一个PySpark应用程序,当我尝试通过命令行将属性设置为spark-submit的一部分时,它运行良好,例如:-
/usr/bin/spark-submit --master yarn --deploy-mode client --queue default --executor-memory 16G --executor-cores 5 --driver-memory 16G --conf spark.sql.shuffle.partitions=1000 --conf spark.port.maxRetries=64 --conf spark.executor.memoryOverhead=2048 --conf spark.shuffle.service.enabled=true --conf spark.shuffle.registration.timeout=300000 --conf spark.shuffle.registration.maxAttempts=5 --conf spark.sql.broadcastTimeout=3600 --conf spark.driver.maxResultSize=5g /home/hadoop/scripts/spark.py
但是,如果我尝试将参数保存在json文件中,如下所示
"sparkRuntimeConfig" : [
{
"key" : "spark.executor.memory",
"value" : "16G"
},
{
"key" : "spark.executor.cores",
"value" : "5"
},
{
"key" : "spark.driver.memory",
"value" : "16G"
},
{
"key" : "spark.sql.shuffle.partitions",
"value" : "1000"
},
{
"key" : "spark.port.maxRetries",
"value" : "64"
},
{
"key" : "spark.executor.memoryOverhead",
"value" : "2048"
},
{
"key" : "spark.shuffle.service.enabled",
"value" : "true"
},
{
"key" : "spark.shuffle.registration.timeout",
"value" : "300000"
},
{
"key" : "spark.shuffle.registration.maxAttempts",
"value" : "5"
},
{
"key" : "spark.sql.broadcastTimeout",
"value" : "3600"
},
{
"key" : "spark.driver.maxResultSize",
"value" : "5g"
}
]
并使用以下代码
spark_conf=SparkConf()
for configuration in sparkRuntimeConfig:
key = configuration['key']
value = configuration['value']
print("Setting run time configuration :- key: {0} , value: {1}".format(key,value))
spark_conf.set(key, value)
spark = SparkSession \
.builder \
.appName("Spark Prod migration") \
.config(conf=spark_conf) \
.master("yarn-client") \
.getOrCreate()
我收到以下错误
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 17823"...
我尝试打印出属性,并向我显示了以下内容
('spark.driver.maxResultSize', '5g')
('spark.driver.memory', '16G')
('spark.executor.cores', '5')
('spark.executor.memory', '16G')
('spark.executor.memoryOverhead', '2048')
('spark.shuffle.registration.timeout', '300000')
('spark.shuffle.service.enabled', 'true')
('spark.sql.broadcastTimeout', '3600')
('spark.sql.shuffle.partitions', '1000')
有人可以帮忙,知道发生了什么事