Question

快速入门指南：在python 3.5中，我期望在内存中占用大约5GB而不是15GB，然后由于资源不足而崩溃。

import pickle
from collections import namedtuple

Hdr1 = namedtuple("Hdr1", "id hash source options elements locations")
Hdr2 = namedtuple("Hdr2", "id hash source options stream locations")
Hdr3 = namedtuple("Hdr3", "id hash source options series locations")
Identifier = namedtuple("Identifier", "id hash")
Location = namedtuple("Location", "index tell")
IndexData = namedtuple("IndexData", "filenames packet1 packet2 packet3")

filenames = [] # filled by other code, but it's a list, with 10 items
packet1_d = {}
packet2_d = {}
packet3_d = {}

index_data = IndexData(filenames, packet1_d, packet2_d, packet3_d)

# for each file
# read all the packets in the file, get the tell() location each time
if packet is header:
  if packet is packet1_header:
    packet1_d[Identifier(id, hash)] = Hdr1(id, hash, source, options, [])
  elif packet is packet2_header:
    packet2_d[Identifier(id, hash)] = Hdr2(id, hash, source, options, stream, [])
  else 
    packet3_d[Identifier(id, hash)] = Hdr3(id, hash, source, options, series, [])
else
  loc = Location(index, tell)
  # This part below is deadly
  if packet is packet1:
    packet1_d[Identifier(id, hash)].locations.append(loc)
  if packet is packet2:
    packet2_d[Identifier(id, hash)].locations.append(loc)
  if packet is packet3:
    packet3_d[Identifier(id, hash)].locations.append(loc)

pickle.dump(index_data, open("index_data.p", "wb"))

详细信息：这显然不是所有代码 - 我保留了打开和解析文件的部分，显然你没有文件可用，所以你可以＆＃39 ; t重现问题。 is语句是伪代码，但在逻辑上是等效的。这是我如何设置数据结构的真实表示，因此对内存使用的估计将是准确的，并且它准确地描述了我的变量的使用方式，因此应该代表查找内存泄漏。

当我注释掉我评论的6行＆＃34;致命＆＃34;，在运行10 GB数据和大约100M数据包之后，pickle文件（仅包含文件名和不同数据包的列表）标题）介于5到10 MB之间。我知道泡菜压缩，但这仍然意味着＆＃34;基础数据＆＃34;小于50 MB。

总共有91,116,480个数据包。为了便于计算，我们只需调用100M即可。每个Location只是文件列表和file.tell()返回的索引。交互式shell中的实证测试表明每个位置都是如此。是64字节：

>>> import sys
>>> from collections import namedtuple
>>> Location = namedtuple("Location", "idx tell")
>>> fobj = open("/really/big/data.file", "rb")
>>> fobj.seek(1000000000)
1000000000
>>> tell = fobj.tell()
>>> loc = Location(9, tell)
>>> sys.getsizeof(loc)
64

因此总内存使用量不应超过6.4 GB。

为什么这会占用超过15 GB的内存？有没有更节省内存的方式来设置这些数据？

我通过将所有数据放在sqlite数据库文件中来解决这个问题。整个文件是2.1GB，因此原始数据似乎不应超过2.1 GB。我可以理解Python中可以获得6GB范围内的开销，但它不应该达到15 GB。即使我已经解决了这个问题，我也想知道下次如何避免它。

Answer 1

我认为你主要是要求一个实现细节 - 很难说Python对象何时会使用2Kb而不是500b - 即使你修复了你正在跟踪的精确问题，在你的数据加倍之前它会很好尺寸再次，

您需要切换到流式方法 - 您可以根据需要读取/处理/写入数据。这将意味着改变输出格式 - 它甚至可以是一个＆＃34; pickle文件＆＃34;，而不是单一字典，你可以Pickle较小的对象（甚至可能是一系列小字典，只是作为＆＃34;更新＆＃34;在阅读时彼此重叠）;

但是，如果你把你的输出切换成一个sqlite数据库，（你甚至可以将你需要的对象作为列数据进行Pickle），那么你可以选择这个数据，还有更多的数据）< / p>

Answer 2

尝试将locations转换为typed array而不是对象列表。该数组在内存中表示为一个高效的C风格数组，因此N个32位数字列表只需要N * 4个字节的内存。

您的Location类型只有一个index和一个tell，因此如果它们都是32位整数，您可以像这样使用'i'类型代码（仅显示packet1个案，为简洁起见：

import array
LocationArray = namedtuple("LocationArray", "index tell")
if packet is header:
    locations = LocationArray(index=array.array('i'), tell=array.array('i'))
    packet1_d[Identifier(id, hash)] = Hdr1(id, hash, source, options, locations)
else:
    loc = packet1_d[Identifier(id, hash)].locations
    loc.index.append(index)
    loc.tell.append(tell)

（编辑使用namedtuple LocationArray而不是普通的元组。）

Answer 3

正如我在另一个答案中所说，这是一种情况，您可以更好地将数据保存在磁盘中，由数据库系统管理。

你面临的问题是，尽管是紧凑的，但是一个namedTuple中的每个字段 - 包括只包含数值的字段，都是一个完整的Python对象。 Python中的整数数字确实使用~30个字节 - 即每个字段加上namedtuple对象大小本身--~64bytes。

在standaribrary中，ctypes模块具有＆＃34;结构＆＃34; 可以创建对象记录数组的基本类型，其中每个记录仅使用其数据所需的字节数。也就是说，如果使用1个4字节整数和1个8字节整数创建结构，则每个记录将占用12个字节。有关数组本身的信息，请加上一百个字节。 ctypes.Structure数组的问题在于您必须使用固定大小创建 - 不可能简单地将更多记录添加到其末尾。如果为每条记录创建一个独立的Structure对象，则每条记录的开销大约为100字节。

Numpy，Python用于处理大数字的事实上的库，以及Pandas的底层引擎（可能是更高级别的问题的更高解决方案）允许您创建具有指定记录的数组，确定每条记录的字节类型。但是普通的numpy数组具有固定大小的相同问题 - 你不能只是将任意记录添加到数组中。

熊猫 - http://pandas.pydata.org/ - 可能就是你应该在那里使用的。

但如果你不是，我已经把几个使用Python的stdlib＆＃34; struct＆＃34;只是在内存中排列数据，允许每个12字节的记录只使用12个字节而不是更多 - 而且它是可选择的。

您可以使用https://github.com/jsbueno/extralist/blob/master/extralist/structsequence.py处的文件 - 每个＆＃34; StrutureSequence＆＃34;对象的创建或多或少像一个namedtuple，加上记录结构信息，如https://docs.python.org/3/library/struct.html#format-strings所述在您的代码中，只需使用您正在创建列表的StructSequence实例 - 您甚至可以将（字段兼容的）namedtuple对象附加到这些序列 - 它们只是将数据存储在内存中。泡菜可以和他们一起使用。

将命名元组附加到字典中的列表时，大量内存膨胀

3 个答案: