为什么重新分区在巨大的pyspark数据框中无法生效?

时间:2019-04-22 14:55:08

标签: python python-3.x apache-spark dataframe pyspark

我有10个节点,每个节点有32个核心,每个核心125 g。我也有一个名为oldEmployee的数据框,其中有两列employeeName及其薪水。

df = ..    
oldEmployee = df.rdd.map(lambda item:....)
    mySchema = StructType([StructField("EmployeeName", StringType(), True),StructField("Salary", DoubleType(), True),])
    oldEmployeeDF = spark.createDataFrame(oldEmployee, schema = mySchema)

现在,我已按如下所示创建了UpdatedEmployee:

d = df.rdd.flatMap(lambda ....)
mySchema = StructType([StructField("EmployeeName", StringType(), True),StructField("salary", DoubleType(), True),])
    NewEmployeeDF = spark.createDataFrame(d, schema = mySchema)

现在,我合并两个数据框:

NewEmployeeDF = NewEmployeeDF.union(oldEmployeeDF)

现在,我计算每位员工的薪金总和:

NewEmployeeDF.registerTempTable("df_table")
salaryDF = spark.sql("SELECT EmployeeName, round(SUM(salary),2) as salary FROM df_table GROUP BY EmployeeName")

当我想获得最高薪水时,我的问题出在以下步骤中。我的工作如下:

maxSalary = salaryDF.agg({"salary": "max"}).collect()[0][0]

这行代码需要6个多小时,甚至还没有完成。在日志文件中,我注意到在使用几个参数执行了几次之后,总是将分区数设置为400,并且执行到达分区200/400并冻结:

19/04/22 06:46:27 INFO TaskSetManager: Finished task 53.0 in stage 37.0 (TID 4288) in 2017303 ms on 172.16.140.175 (executor 41) (199/400)
19/04/22 06:46:37 INFO TaskSetManager: Finished task 192.0 in stage 37.0 (TID 4427) in 2027473 ms on 172.16.140.254 (executor 1) (200/400)

您注意到每个任务的时间都很长(2027473〜33分钟),我不明白为什么要这么做

首先,salaryDF非常庞大,但是我该如何解决这个问题?为什么按我的定义,spark将salaryDF划分为400而不是590?谢谢你的建议

第二,对于庞大的数据框,您认为最好使用 coalesce代替repartition

请注意,conf变量的定义如下:

conf = (SparkConf()
         #.setMaster("local[*]")
         .setAppName(appName)
         .set("spark.executor.memory", "18g")
         .set('spark.driver.memory', '18g')
         .set('spark.executor.memoryOverhead',"2g")
         .set("spark.network.timeout", "800s")
         #.set("spark.eventLog.enabled", True)
         .set("spark.files.overwrite", "true")
         .set("spark.executor.heartbeatInterval", "20s")
         .set("spark.driver.maxResultSize", "1g")
         .set("spark.executor.instances", 59)
         .set("spark.executor.cores", 5)
         .set("spark.driver.cores", 5)
         .set("spark.default.parallelism", 590)# this takes effect only for RDD you must do repartition for dataframe
         )

通过设置.set("spark.sql.shuffle.partitions", 590),我得到以下信息:

Traceback (most recent call last):
  File "/home/moudi/main.py", line 778, in <module>
    d= maxSalary.agg({"salary": "max"}).collect()[0][0]# the new added
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 466, in collect
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o829.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 37 (collect at /home/tamouze/main.py:778) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 19    at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:867)    at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:863)    at ...

0 个答案:

没有答案