Spark 1.4 increase maxResultSize memory

时间:2015-06-25 18:51:55

标签: python memory apache-spark pyspark jupyter

I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas() function I receive the following error: serialized results of 9 tasks (1096.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) I tried to fix this changing the spark-config file and still getting the same error. I've heard that this is a problem with spark 1.4 and wondering if you know how to solve this. Any help is much appreciated.

7 个答案:

答案 0 :(得分:43)

You can set spark.driver.maxResultSize parameter in the SparkConf object: from pyspark import SparkConf, SparkContext # In Jupyter you have to stop the current context first sc.stop() # Create new config conf = (SparkConf() .set("spark.driver.maxResultSize", "2g")) # Create new context sc = SparkContext(conf=conf) You should probably create a new SQLContext as well: from pyspark.sql import SQLContext sqlContext = SQLContext(sc)

答案 1 :(得分:21)

从命令行,例如使用pyspark,--conf spark.driver.maxResultSize=3g也可用于增加最大结果大小。

答案 2 :(得分:9)

考虑到运行环境,调整spark.driver.maxResultSize是一个很好的做法。但是,它不是您的问题的解决方案,因为数据量可能会随时间变化。正如@ Zia-Kayani所提到的,明智地收集数据会更好。因此,如果您有一个DataFrame df,那么您可以调用df.rdd并在群集上执行所有神奇的操作,而不是在驱动程序中。但是,如果您需要收集数据,我建议:

  • 请勿开启spark.sql.parquet.binaryAsString。字符串对象占用更多空间
  • 使用spark.rdd.compress在收集RDD时压缩它们
  • 尝试使用分页来收集它。 (Scala中的代码,来自另一个答案Scala: How to get a range of rows in a dataframe
      

    long count = df.count() int limit = 50; while(count > 0){ df1 = df.limit(limit); df1.show(); //will print 50, next 50, etc rows df = df.except(df1); count = count - limit; }

答案 3 :(得分:7)

Looks like you are collecting the RDD, So it will definitely collect all the data to driver node that's why you are facing this issue. You have to avoid collect data if not required for a rdd, or if its necessary then specify spark.driver.maxResultSize. there are two ways of defining this variable 1 - create Spark Config by setting this variable as conf.set("spark.driver.maxResultSize", "3g") 2 - or set this variable in spark-defaults.conf file present in conf folder of spark. like spark.driver.maxResultSize 3g and restart the spark.

答案 4 :(得分:2)

还有一个Spark bug https://issues.apache.org/jira/browse/SPARK-12837 这给出了同样的错误

 serialized results of X tasks (Y MB) is bigger than spark.driver.maxResultSize

即使您可能没有明确地将数据提取到驱动程序。

SPARK-12837解决了一个Spark错误,即Spark 2之前的累加器/广播变量被拉到驱动程序不必要导致此问题。

答案 5 :(得分:2)

启动作业或终端时,您可以使用

--conf spark.driver.maxResultSize="0"

消除瓶颈

答案 6 :(得分:0)

启动pyspark shell时,可以将spark.driver.maxResultSize设置为2GB:

pyspark  --conf "spark.driver.maxResultSize=2g"

这是为了允许将2Gb用于spark.driver.maxResultSize