Watson Studio“ Spark Environment”-如何增加`spark.driver.maxResultSize`?

时间:2018-11-24 14:34:21

标签: watson-studio

我正在执行火花作业,正在读取,操作并将许多txt文件合并为一个文件,但是遇到了这个问题:

  

Py4JJavaError:调用o8483.collectToPython时发生错误。   :org.apache.spark.SparkException:由于阶段故障导致作业中止:838个任务的序列化结果的总大小(1025.6 MB)大于spark.driver.maxResultSize(1024.0 MB)

是否可以增加spark.driver.maxResultSize的大小?

注意:此问题与WS Spark“环境”有关,而不与Analytics Engine有关。

1 个答案:

答案 0 :(得分:0)

如果您正在使用“ Analytics Engine” Spark集群实例,则可以通过Ambari控制台增加默认值。您可以从console.bluemix.net中的IAE实例获取指向Ambari控制台的链接和凭据。在Ambari控制台中,在

中添加新属性
  

Spark2->“自定义spark2-defaults”->添加属性-> spark.driver.maxResultSize = 2GB

确保spark.driver.maxResultSize值小于

中设置的驱动程序内存。
  

Spark2->“高级spark2-env”->内容-> SPARK_DRIVER_MEMORY

如果您只是尝试创建一个CSV文件而又不想更改spark conf值(因为您不知道最终文件的大小),则另一个建议是使用类似下面的函数,该函数使用hdfs getmerge函数可以像创建熊猫一样创建单个csv文件。

def writeSparkDFAsCSV_HDFS(spark_df, file_location,file_name, csv_sep=',', csv_quote='"'):
    """
    It can be used to write large spark dataframe as a csv file without running 
    into memory issues while converting to pandas dataframe.
    It first writes the spark df to a temp hdfs location and uses getmerge to create 
    a single file. After adding a header, the merged file is moved to hdfs.

    Args:
        spark_df (spark dataframe) : Data object to be written to file.
        file_location (String) : Directory location of the file.
        file_name (String) : Name of file to write to.
        csv_sep (character) : Field separator to use in csv file
        csv_quote (character) : Quote character to use in csv file
    """
    # define temp and final paths
    file_path= os.path.join(file_location,file_name)
    temp_file_location = tempfile.NamedTemporaryFile().name 
    temp_file_path = os.path.join(temp_file_location,file_name)

    print("Create directories")
    #create directories if not exist in both local and hdfs
    !mkdir $temp_file_location
    !hdfs dfs -mkdir $file_location
    !hdfs dfs -mkdir $temp_file_location

    # write to temp hdfs location
    print("Write to temp hdfs location : {}".format("hdfs://" + temp_file_path))
    spark_df.write.csv("hdfs://" + temp_file_path, sep=csv_sep, quote=csv_quote)


    # merge file from hadoop to local
    print("Merge and put file at {}".format(temp_file_path))
    !hdfs dfs -getmerge $temp_file_path $temp_file_path

    # Add header to the merged file
    header = ",".join(spark_df.columns)
    !rm $temp_file_location/.*crc
    line_prepender(temp_file_path, header)

    #move the final file to hdfs
    !hdfs dfs -put -f $temp_file_path $file_path

    #cleanup temp locations
    print("Cleanup..")
    !rm -rf $temp_file_location
    !hdfs dfs -rm -r $temp_file_location
    print("Done!")