如何正确配置maxResultSize?

时间:2019-05-25 05:43:27

标签: apache-spark pyspark

我找不到设置驱动程序最大结果大小的方法。下面是我的配置。

SOME_PATH

我在加入2个大表并收集后得到此错误

conf = pyspark.SparkConf().setAll([("spark.driver.extraClassPath", "/usr/local/bin/postgresql-42.2.5.jar")
                               ,("spark.executor.instances", "4")
                               ,("spark.executor.cores", "4")
                               ,("spark.executor.memories", "10g")
                              ,("spark.driver.memory", "15g")
                               ,("spark.dirver.maxResultSize", "0")
                              ,("spark.memory.offHeap.enabled","true")
                               ,("spark.memory.offHeap.size","20g")])


sc = pyspark.SparkContext(conf=conf)
sc.getConf().getAll()
sqlContext = SQLContext(sc)

我在建议给maxResultsize的堆栈溢出中也遇到了类似的问题,但我无法弄清楚如何正确地做到这一点。

1 个答案:

答案 0 :(得分:1)

以下应该可以解决问题。另请注意,您拼写错误的("spark.executor.memories", "10g")。正确的配置为'spark.executor.memory'

from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .master('yarn') # depends on the cluster manager of your choice
    .appName('StackOverflow')
    .config('spark.driver.extraClassPath', '/usr/local/bin/postgresql-42.2.5.jar')
    .config('spark.executor.instances', 4)
    .config('spark.executor.cores', 4)
    .config('spark.executor.memory', '10g')
    .config('spark.driver.memory', '15g')
    .config('spark.memory.offHeap.enabled', True)
    .config('spark.memory.offHeap.size', '20g')
    .config('spark.dirver.maxResultSize', '4096') 
)
sc = spark.sparkContext

或者,尝试以下操作:

from pyspark import SparkContext
from pyspark import SparkConf

conf = SparkConf()
          .setMaster('yarn') \
          .setAppName('StackOverflow') \
          .set('spark.driver.extraClassPath', '/usr/local/bin/postgresql-42.2.5.jar') \
          .set('spark.executor.instances', 4) \
          .set('spark.executor.cores', 4) \
          .set('spark.executor.memory', '10g') \
          .set('spark.driver.memory', '15g') \
          .set('spark.memory.offHeap.enabled', True) \
          .set('spark.memory.offHeap.size', '20g') \
          .set('spark.dirver.maxResultSize', '4096') 

spark_context = SparkContext(conf=conf)