我有10个节点,每个节点有32个核心,每个核心125 g。我也有一个名为oldEmployee的数据框,其中有两列employeeName及其薪水。
df = ..
oldEmployee = df.rdd.map(lambda item:....)
mySchema = StructType([StructField("EmployeeName", StringType(), True),StructField("Salary", DoubleType(), True),])
oldEmployeeDF = spark.createDataFrame(oldEmployee, schema = mySchema)
现在,我已按如下所示创建了UpdatedEmployee:
d = df.rdd.flatMap(lambda ....)
mySchema = StructType([StructField("EmployeeName", StringType(), True),StructField("salary", DoubleType(), True),])
NewEmployeeDF = spark.createDataFrame(d, schema = mySchema)
现在,我合并两个数据框:
NewEmployeeDF = NewEmployeeDF.union(oldEmployeeDF)
现在,我计算每位员工的薪金总和:
NewEmployeeDF.registerTempTable("df_table")
salaryDF = spark.sql("SELECT EmployeeName, round(SUM(salary),2) as salary FROM df_table GROUP BY EmployeeName")
当我想获得最高薪水时,我的问题出在以下步骤中。我的工作如下:
maxSalary = salaryDF.agg({"salary": "max"}).collect()[0][0]
这行代码需要6个多小时,甚至还没有完成。在日志文件中,我注意到在使用几个参数执行了几次之后,总是将分区数设置为400,并且执行到达分区200/400并冻结:
19/04/22 06:46:27 INFO TaskSetManager: Finished task 53.0 in stage 37.0 (TID 4288) in 2017303 ms on 172.16.140.175 (executor 41) (199/400)
19/04/22 06:46:37 INFO TaskSetManager: Finished task 192.0 in stage 37.0 (TID 4427) in 2027473 ms on 172.16.140.254 (executor 1) (200/400)
您注意到每个任务的时间都很长(2027473〜33分钟),我不明白为什么要这么做
首先,salaryDF非常庞大,但是我该如何解决这个问题?为什么按我的定义,spark将salaryDF划分为400而不是590?谢谢你的建议
第二,对于庞大的数据框,您认为最好使用
coalesce
代替repartition
?
请注意,conf变量的定义如下:
conf = (SparkConf()
#.setMaster("local[*]")
.setAppName(appName)
.set("spark.executor.memory", "18g")
.set('spark.driver.memory', '18g')
.set('spark.executor.memoryOverhead',"2g")
.set("spark.network.timeout", "800s")
#.set("spark.eventLog.enabled", True)
.set("spark.files.overwrite", "true")
.set("spark.executor.heartbeatInterval", "20s")
.set("spark.driver.maxResultSize", "1g")
.set("spark.executor.instances", 59)
.set("spark.executor.cores", 5)
.set("spark.driver.cores", 5)
.set("spark.default.parallelism", 590)# this takes effect only for RDD you must do repartition for dataframe
)
通过设置.set("spark.sql.shuffle.partitions", 590)
,我得到以下信息:
Traceback (most recent call last):
File "/home/moudi/main.py", line 778, in <module>
d= maxSalary.agg({"salary": "max"}).collect()[0][0]# the new added
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 466, in collect
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o829.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 37 (collect at /home/tamouze/main.py:778) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 19 at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:867) at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:863) at ...