加入多个数据帧时,Spark非常慢

时间:2017-12-21 22:31:39

标签: performance scala apache-spark join

我正在尝试加入多个数据框:

DF1: count =2296872, No of partitions= 8                                                 
DF2: count = 113274, No of partitions = 4
DF3: count = 1351189, No of partitions = 8
DF4: count = 152291, No of partitions = 2
DF5: count = 481527, No of partitions = 8
DF6: count = 481518, No of partitions = 8
DF7: count = 7714, No of partitions = 1
DF8: count = 39521, No of partitions = 1
DF9: count = 4086, No of partitions = 1
DF10: count == 481527, No of partitions = 8

此外,我使用SPARK_SQL_SHUFFLE_PARTITIONS = 2000

代码:

val joinedDfs =listContainingMultipleDfs.foldLeft(broadcast(smallerDf))((a, b) => a.join(b, Seq("key"), "left"))

    joinedDfs.write.
              option("header", "false").
              option("quote", null).
              option("delimiter", Delimiter).
              csv(tempPath) // AWS path

如何提高Spark的性能?

堆栈追踪:

java.lang.OutOfMemoryError: Java heap space
    at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:326)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
    at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)

连接需要4个小时,然后在尝试保存到S3时,它死于上述错误。

规格: TOTAL_EXECUTOR_CORES:-8

EXECUTOR_MEMORY: - '4G'

DRIVER_MEMORY: - '4G'

0 个答案:

没有答案