我生成了一个约900MB的序列文件,其中包含8条记录,其中6条记录的大小约为128MB(HDFS的块大小)。
在Pyspark中,我按如下方式阅读(键和值都是自定义java类):
rdd = sc.sequenceFile("hdfs:///Test.seq", keyClass = "ChunkID", valueClass="ChunkData", keyConverter="KeyToChunkConverter", valueConverter="DataToChunkConverter")
rdd.getNumPartitions()
表示有7个分区。我尝试按如下方式打开它:
def open_map():
def open_map_nested(key_value):
try:
# ChunkID, ChunkData
key, data = key_value
if key[0] == 0:
return [['if', 'if', 'if']]
else:
return [["else","else","else"]]
except Exception, e:
logging.exception(e)
return [["None","None","None"],["None","None","None"]] #["None"]
return open_map_nested
result = rdd.flatMap(open_map()).collect()
但是,内存错误发生如下:
File "/home/wong/spark_install/spark-2.0.2-bin-hadoop2.7/python/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/wong/spark_install/spark-2.0.2-bin-hadoop2.7/python/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/home/wong/spark_install/spark-2.0.2-bin-hadoop2.7/python/pyspark/serializers.py", line 139, in load_stream
yield self._read_with_length(stream)
File "/home/wong/spark_install/spark-2.0.2-bin-hadoop2.7/python/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/home/wong/spark_install/spark-2.0.2-bin-hadoop2.7/python/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
MemoryError
在序列化对象时似乎有一些问题,并且上述程序的处理时间很长。 (顺便说一句,执行程序内存在我的系统中是3GB。)
感谢您的帮助!我尝试使用count
代替,但问题仍然存在。我认为在读取每个输入拆分时,问题出现在执行程序节点中。所以我再次检查我的执行者的配置,试图弄清楚因果关系。设置为--executor-memory 2500M
和--conf spark.yarn.executor.memoryOverhead=512
。根据计算
的this article,
有效的MemoryStore容量大约是1.2GB,这也是
在日志文件中显示如下:
17/03/30 17:15:57 INFO memory.MemoryStore: MemoryStore started with capacity 1153.3 MB
17/03/30 17:15:58 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.100.5:34875
17/03/30 17:15:58 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
.....
17/03/30 17:16:26 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 42.2 KB, free 1153.3 MB)
17/03/30 17:16:26 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 18 ms
17/03/30 17:16:26 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 545.5 KB, free 1152.8 MB)
我还发现了有关pySpark序列化程序here的讨论,但将批量大小更改为无限制无济于事。
我的问题是:
任何建议都将受到高度赞赏!!