我正在尝试加入多个数据框:
DF1: count =2296872, No of partitions= 8
DF2: count = 113274, No of partitions = 4
DF3: count = 1351189, No of partitions = 8
DF4: count = 152291, No of partitions = 2
DF5: count = 481527, No of partitions = 8
DF6: count = 481518, No of partitions = 8
DF7: count = 7714, No of partitions = 1
DF8: count = 39521, No of partitions = 1
DF9: count = 4086, No of partitions = 1
DF10: count == 481527, No of partitions = 8
此外,我使用SPARK_SQL_SHUFFLE_PARTITIONS = 2000
代码:
val joinedDfs =listContainingMultipleDfs.foldLeft(broadcast(smallerDf))((a, b) => a.join(b, Seq("key"), "left"))
joinedDfs.write.
option("header", "false").
option("quote", null).
option("delimiter", Delimiter).
csv(tempPath) // AWS path
如何提高Spark的性能?
堆栈追踪:
java.lang.OutOfMemoryError: Java heap space
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:326)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
连接需要4个小时,然后在尝试保存到S3时,它死于上述错误。
规格: TOTAL_EXECUTOR_CORES:-8
EXECUTOR_MEMORY: - '4G'
DRIVER_MEMORY: - '4G'