如何在pyspark中播放巨大的rdd?

时间:2019-02-25 03:55:44

标签: apache-spark pyspark

当我打印出rdd的第一个元素时,如下所示:

print("input = {}".format(input.take(1)[0]))

我得到的结果是:(u'motor', [0.001,..., 0.9])

[0.001,..., 0.9]的类型是一个列表。

输入rdd中的元素数等于53304100

当我要广播输入RDD时,出现了我的问题:

brod = sc.broadcast(input.collect())

生成的异常如下(我仅展示了该概念的第一部分):

    WARN TaskSetManager: Lost task 56.0 in stage 1.0 (TID 176, 172.16.140.144, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main
    process()
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.3.0/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: <lambda>() missing 1 required positional argument: 'document'

1 个答案:

答案 0 :(得分:1)

如果RDD太大,应用程序可能会遇到OutOfMemory错误,这是由于collect方法将通常驱动程序内存不足的所有数据拉到驱动程序的内存中。

因此,您可以尝试增加驱动程序的内存空间

pyspark --driver-memory 4g