在Pyspark中读取大型序列文件

时间:2017-03-27 22:34:51

标签: apache-spark pyspark

我生成了一个约900MB的序列文件,其中包含8条记录,其中6条记录的大小约为128MB(HDFS的块大小)。

在Pyspark中,我按如下方式阅读(键和值都是自定义java类):

rdd = sc.sequenceFile("hdfs:///Test.seq", keyClass = "ChunkID", valueClass="ChunkData", keyConverter="KeyToChunkConverter", valueConverter="DataToChunkConverter")

rdd.getNumPartitions()表示有7个分区。我尝试按如下方式打开它:

def open_map():
    def open_map_nested(key_value):
        try:
            # ChunkID, ChunkData 
            key, data = key_value
            if key[0] == 0:               
                return [['if', 'if', 'if']] 
            else:
                return [["else","else","else"]]
        except Exception, e:
            logging.exception(e)
            return [["None","None","None"],["None","None","None"]] #["None"]
    return open_map_nested
result = rdd.flatMap(open_map()).collect()

但是,内存错误发生如下:

  File "/home/wong/spark_install/spark-2.0.2-bin-hadoop2.7/python/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/wong/spark_install/spark-2.0.2-bin-hadoop2.7/python/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/wong/spark_install/spark-2.0.2-bin-hadoop2.7/python/pyspark/serializers.py", line 139, in load_stream
    yield self._read_with_length(stream)
  File "/home/wong/spark_install/spark-2.0.2-bin-hadoop2.7/python/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/home/wong/spark_install/spark-2.0.2-bin-hadoop2.7/python/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
MemoryError

在序列化对象时似乎有一些问题,并且上述程序的处理时间很长。 (顺便说一句,执行程序内存在我的系统中是3GB。)

  • 更新:

感谢您的帮助!我尝试使用count代替,但问题仍然存在。我认为在读取每个输入拆分时,问题出现在执行程序节点中。所以我再次检查我的执行者的配置,试图弄清楚因果关系。设置为--executor-memory 2500M--conf spark.yarn.executor.memoryOverhead=512。根据计算    的this article,    有效的MemoryStore容量大约是1.2GB,这也是    在日志文件中显示如下:

17/03/30 17:15:57 INFO memory.MemoryStore: MemoryStore started with capacity 1153.3 MB
17/03/30 17:15:58 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.100.5:34875
17/03/30 17:15:58 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
.....
17/03/30 17:16:26 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 42.2 KB, free 1153.3 MB)
17/03/30 17:16:26 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 18 ms
17/03/30 17:16:26 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 545.5 KB, free 1152.8 MB)

我还发现了有关pySpark序列化程序here的讨论,但将批量大小更改为无限制无济于事。

我的问题是:

  • 当MemoryStore的容量为1.2GB且每个inputsplit(记录)为128MB时,为什么会出现内存错误?
  • 是否有任何推荐的方法来读取大型序列文件 pyspark(减少打开文件或避免内存错误的时间)?

任何建议都将受到高度赞赏!!

0 个答案:

没有答案