Question

我有一个PySpark应用程序，当我尝试通过命令行将属性设置为spark-submit的一部分时，它运行良好，例如：-

/usr/bin/spark-submit --master yarn --deploy-mode client --queue default --executor-memory 16G --executor-cores 5 --driver-memory 16G --conf spark.sql.shuffle.partitions=1000  --conf spark.port.maxRetries=64 --conf spark.executor.memoryOverhead=2048 --conf spark.shuffle.service.enabled=true --conf spark.shuffle.registration.timeout=300000 --conf spark.shuffle.registration.maxAttempts=5 --conf spark.sql.broadcastTimeout=3600 --conf spark.driver.maxResultSize=5g /home/hadoop/scripts/spark.py

但是，如果我尝试将参数保存在json文件中，如下所示

"sparkRuntimeConfig" : [ 
      {
         "key" : "spark.executor.memory",
         "value" : "16G"
      },

      {
         "key" : "spark.executor.cores",
         "value" : "5"
      },
      {
         "key" : "spark.driver.memory",
         "value" : "16G"
      },
      {
         "key" : "spark.sql.shuffle.partitions",
         "value" : "1000"
      },
      {
         "key" : "spark.port.maxRetries",
         "value" : "64"
      },
      {
         "key" : "spark.executor.memoryOverhead",
         "value" : "2048"
      },
      {
         "key" : "spark.shuffle.service.enabled",
         "value" : "true"
      },
      {
         "key" : "spark.shuffle.registration.timeout",
         "value" : "300000"
      },
      {
         "key" : "spark.shuffle.registration.maxAttempts",
         "value" : "5"
      },
      {
         "key" : "spark.sql.broadcastTimeout",
         "value" : "3600"
      },
      {
         "key" : "spark.driver.maxResultSize",
         "value" : "5g"
      }

    ]

并使用以下代码

 spark_conf=SparkConf()
        for configuration in sparkRuntimeConfig:
            key = configuration['key']
            value = configuration['value']
            print("Setting run time configuration :- key: {0} , value: {1}".format(key,value))
            spark_conf.set(key, value)

        spark = SparkSession \
                    .builder \
                    .appName("Spark Prod migration") \
                    .config(conf=spark_conf) \
                    .master("yarn-client") \
                    .getOrCreate()

我收到以下错误

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 17823"...

我尝试打印出属性，并向我显示了以下内容

('spark.driver.maxResultSize', '5g')
('spark.driver.memory', '16G')
('spark.executor.cores', '5')
('spark.executor.memory', '16G')
('spark.executor.memoryOverhead', '2048')
('spark.shuffle.registration.timeout', '300000')
('spark.shuffle.service.enabled', 'true')
('spark.sql.broadcastTimeout', '3600')
('spark.sql.shuffle.partitions', '1000')

有人可以帮忙，知道发生了什么事

尝试在PySpark中设置Spark属性时出现java.lang.OutOfMemoryError

0 个答案: