我正在通过pyspark读取大的zip文件,方法是分批读取文件,然后分块处理内容。内容的后处理,关于主存储器的使用不会下降。
spark = SparkSession.builder().getOrCreate()
buffer = io.BytesIO(<zip file from s3>)
z = zipfile.ZipFile(buffer)
with z.open(z.infolist()[0]) as f:
line_counter=0
for line in f:
# Append file contents to list
data.append(line)
line_counter=line_counter+1
# Reset counters if record count hit max-data-length threshold
# Create spark dataframes
if not line_counter % 100000:
main(spark, data)
Initial memory usage on master:
total used free shared buffers cached
Mem: 127802 20115 107687 0 118 10424
-/+ buffers/cache: 9571 118230
Swap: 0 0 0
Memory usage after few runs:
# free -m
total used free shared buffers cached
Mem: 127802 65449 62353 0 119 10491
-/+ buffers/cache: 54838 72963
Swap: 0 0 0
#
total used free shared buffers cached
Mem: 127802 92898 34904 0 119 10501
-/+ buffers/cache: 82276 45526
Swap: 0 0 0
我希望在一个处理周期后将内存使用恢复为原始状态,但是这种情况不会发生,因为某个迭代集群正在引发内存异常。