覆盖数据帧变量时Spark内存泄漏

时间:2017-05-25 17:09:45

标签: python apache-spark memory-leaks pyspark apache-spark-sql

我遇到火花驱动器中的这个内存泄漏,我似乎无法弄清楚原因。我的猜测是它与尝试覆盖DataFrame变量有关,但我找不到任何文档或其他问题。

这是在Spark 2.1.0(PySpark)上。

from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *

spark = SparkSession \
    .builder \
    .appName("Spark Leak") \
    .getOrCreate()

sc = spark.sparkContext
sqlContext = SQLContext.getOrCreate(sc)

items = 5000000
data = [str(x) for x in range(1,items)]

df = sqlContext.createDataFrame(data, StringType())
print(df.count())

for x in range(0,items):
    sub_df = sqlContext.createDataFrame([str(x)], StringType())
    df = df.subtract(sub_df)

    print(df.count())

这将继续运行,直到驱动程序内存不足然后死亡。

java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:210)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.net.SocketInputStream.read(SocketInputStream.java:224)
    at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:917)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1089)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1081)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1081)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1184)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1717)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1675)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1664)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
17/05/25 16:55:40 ERROR DAGScheduler: Failed to update accumulators for task 13
java.net.SocketException: Broken pipe (Write failed)
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at java.io.DataOutputStream.flush(DataOutputStream.java:123)
    at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:915)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1089)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1081)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1081)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1184)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1717)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1675)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1664)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
...

如果有的话,我认为内存应该缩小,因为从DataFrame移除了项目,但事实并非如此。

我不明白spark如何将DataFrames分配给Python变量或什么?

我还尝试将df.subtract分配给一个新的临时变量,然后取消分配df,然后将临时变量分配给df并且不存在临时变量但是也具有同样的问题。

1 个答案:

答案 0 :(得分:1)

这里的根本问题似乎是理解究竟是什么DataFrame(这也适用于Spark RDDs)。本地DataFrame对象有效地描述了在给定对象上执行某些操作时要执行的计算。

因此,它是一个递归结构,捕获所有依赖项。每次迭代都有效执行计划。虽然Spark提供的工具,如checkpointing,可以用来解决这个问题并削减血统,但是这些代码首先没有多大意义。

Spark中可用的分布式数据结构专为高延迟,IO密集型作业而设计。并行化单个对象,在数百万个分布式对象上执行数百万个Spark作业无法正常工作。

此外,Spark不是为高效的单项操作而设计的。每个subtract都是 O(N),使整个过程 O(N 2 ,并且对任何大型数据集都无效。

虽然本身就是一个正确的方法,但是这样做会是这样的:

items = 5000000

df1 = spark.range(items).selectExpr("cast(id as string)")
df2 = spark.range(items).selectExpr("cast(id as string)")
df1.subtract(df2)