我找不到设置驱动程序最大结果大小的方法。下面是我的配置。
SOME_PATH
我在加入2个大表并收集后得到此错误
conf = pyspark.SparkConf().setAll([("spark.driver.extraClassPath", "/usr/local/bin/postgresql-42.2.5.jar")
,("spark.executor.instances", "4")
,("spark.executor.cores", "4")
,("spark.executor.memories", "10g")
,("spark.driver.memory", "15g")
,("spark.dirver.maxResultSize", "0")
,("spark.memory.offHeap.enabled","true")
,("spark.memory.offHeap.size","20g")])
sc = pyspark.SparkContext(conf=conf)
sc.getConf().getAll()
sqlContext = SQLContext(sc)
我在建议给maxResultsize的堆栈溢出中也遇到了类似的问题,但我无法弄清楚如何正确地做到这一点。
答案 0 :(得分:1)
以下应该可以解决问题。另请注意,您拼写错误的("spark.executor.memories", "10g")
。正确的配置为'spark.executor.memory'
。
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master('yarn') # depends on the cluster manager of your choice
.appName('StackOverflow')
.config('spark.driver.extraClassPath', '/usr/local/bin/postgresql-42.2.5.jar')
.config('spark.executor.instances', 4)
.config('spark.executor.cores', 4)
.config('spark.executor.memory', '10g')
.config('spark.driver.memory', '15g')
.config('spark.memory.offHeap.enabled', True)
.config('spark.memory.offHeap.size', '20g')
.config('spark.dirver.maxResultSize', '4096')
)
sc = spark.sparkContext
或者,尝试以下操作:
from pyspark import SparkContext
from pyspark import SparkConf
conf = SparkConf()
.setMaster('yarn') \
.setAppName('StackOverflow') \
.set('spark.driver.extraClassPath', '/usr/local/bin/postgresql-42.2.5.jar') \
.set('spark.executor.instances', 4) \
.set('spark.executor.cores', 4) \
.set('spark.executor.memory', '10g') \
.set('spark.driver.memory', '15g') \
.set('spark.memory.offHeap.enabled', True) \
.set('spark.memory.offHeap.size', '20g') \
.set('spark.dirver.maxResultSize', '4096')
spark_context = SparkContext(conf=conf)