Question

我正在寻找有关如何减少python内存使用量的一些提示。我使用这段代码作为保存数据的主要结构：

http://stevehanov.ca/blog/index.php?id=114

我需要它用于使用烧瓶服务器进行邻近字匹配。我需要放置超过2000万个不同的字符串（它会增加）。现在，当我试图在Trie中投入大约1400万时，我得到了MemoryError。

我只是添加一个字典来保存一些快速访问的值（我需要它，但它可以被认为是一种外观ID，它与单词没有直接关系）

  class TrieNode:
    values = {}
    def __init__(self):
        self.word = None
        self.children = {}

        global NodeCount
        NodeCount += 1

    def insert( self, word, value):
        node = self
        for letter in word:
            if letter not in node.children: 
                node.children[letter] = TrieNode()

            node = node.children[letter]
        TrieNode.values[word] = value
        node.word = word

我不熟悉Python优化，有没有办法制作＆＃34;字母＆＃34;对象不那么大，以节省一些记忆？

请注意，我的困难来自于这封信不仅是[a-z]而是需要处理所有＆＃34; unicode范围＆＃34; （比如强调的角色，但不仅仅是）。顺便说一句，它是一个单个字符，所以它应该是非常轻的内存指纹。我怎样才能使用代码点而不是字符串对象（它会更节省内存）？

编辑：在@ juanpa-arrivillaga的回复后添加一些其他信息

所以，首先我发现在我的计算机上使用插槽构造没有区别，有或没有__slot__我看到相同的内存使用情况。

__slot__：

>>> class TrieNode:
    NodeCount = 0
    __slots__ = "word", "children"
    def __init__(self):

    self.word = None
    self.children = {}

    #global NodeCount # my goal is to encapsulated the NodeCount in the class itself
    TrieNode.NodeCount += 1


>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
176

没有__slot__：

>>> class TrieNode:
    NodeCount = 0
    def __init__(self):

        self.word = None
        self.children = {}

        #global NodeCount
        TrieNode.NodeCount += 1


>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
176

所以我不明白，为什么。我哪里错了？

这是我尝试的其他东西，使用＆＃34;实习生＆＃34;关键字，因为此值是处理＆＃34; id＆＃34;的字符串。（因此与unicode无关，与字母无关）：

顺便说一句，我的目标是使用值和NodeCount，类/静态变量的等效概念，以便它们中的每一个都由小型创建的objets的所有实例共享，我认为它会保留内存并避免重复，但是我可能错了，因为我对＃34;类似静态的＆＃34; Python中的概念）

class TrieNode:
    values = {}    # shared amon all instances so only one structure?
    NodeCount = 0
    __slots__ = "word", "children"
    def __init__(self):

      self.word = None
      self.children = {}

      #global NodeCount
      TrieNode.NodeCount += 1

    def insert( self, word, value = None):
        # value is a string id like "XYZ999999999"
        node = self
        for letter in word:
            codepoint = ord(letter) 
            if codepoint not in node.children: 
                 node.children[codepoint] = TrieNode()

        node = node.children[codepoint]

        node.word = word
        if value is not None:
             lost = TrieNode.values.setdefault(word, [])
             TrieNode.values[word].append(intern(str(value)))

增加：最后，我应该准备好使用Python 2.7.x系列。

我想知道是否有任何来自库的固定len数据类型，如numpy可以帮助我节省一些内存，再次作为新的，我不知道在哪里看。 Btw＆＃34;字＆＃34;不是真实的＆＃34;自然语言词＆＃34;但是＆＃34;任意长度的字符序列＆＃34;它们也可能很长。

从您的回复中，我同意避免在每个节点中存储该单词会很有效，但您需要查看链接的文章/代码段。主要目标不是重建这个词，而是能够使用这个词进行有效/非常快速的近似字符串匹配，然后获得＆＃34;值＆＃34;与每个最接近的比赛相关，我不确定我理解到树的路径的目标是什么。（没有到达完整的树？），当匹配时我们只需要匹配原始单词，（但此时我的理解可能是错误的。）

所以我需要在某个地方拥有这个巨大的dict，我想在类中封装以方便。但是，从内存和重量来看，这可能是太昂贵了。观点？

我也注意到我的内存使用量已经比你的样本少了（我现在还不知道为什么），但是这里是＆＃34;字母＆＃34;的示例值。包含在结构中。

>>> s = u"\u266f"
>>> ord(s)
9839
>>> sys.getsizeof(s)
28
>>> sys.getsizeof(ord(s))
12
>>> print s
♯
>>> repr(s)
"u'\\u266f'"

Answer 1

低挂果：use __slots__ in your node class，否则，每个TrieNode对象都带有dict。

class TrieNode:
    __slots__ = "word", "children"
    def __init__(self):
        self.word = None
        self.children = {}

现在，每个TrieNode对象都不会携带属性dict。比较尺寸：

>>> class TrieNode:
...     def __init__(self):
...         self.word = None
...         self.children = {}
...
>>> tn = TrieNode()
>>> sys.getsizeof(tn) + sys.getsizeof(tn.__dict__)
168

Vs的：

>>> class TrieNode:
...     __slots__ = "word", "children"
...     def __init__(self):
...         self.is_word = False
...         self.children = {}
...
>>> sys.getsizeof(tn)
56
>>> tn.__dict__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'TrieNode' object has no attribute '__dict__'

另一个优化，使用int个对象。缓存小int个对象，无论如何很可能你的大部分角色都在这个范围内，但即使它们不是，int虽然在Python中仍然很强大，但却小于甚至是一个字符串：

>>> 'ñ'
'ñ'
>>> ord('ñ')
241
>>> sys.getsizeof('ñ')
74
>>> sys.getsizeof(ord('ñ'))
28

所以你可以这样做：

def insert( self, word, value):
    node = self
    for letter in word:
        code_point = ord(letter)
        if code_point not in node.children: 
            node.children[code_point] = TrieNode()

        node = node.children[code_point]
    node.is_word = True #Don't save the word, simply a reference to a singleton

此外，您正在保持一个巨大的类变量values dict，但这些信息是多余的。你说：

我只是添加一个字典来保存一些快速访问的价值（我需要它）

您可以重建路径中的单词。它应该相对较快，我会认真考虑反对这个dict。查看需要多少内存才能容纳一百万个单字符字符串：

>>> d = {str(i):i for i in range(1000000)}
>>> (sum(sizeof(k)+sizeof(v) for k,v in d.items()) + sizeof(d)) * 1e-9
0.12483203000000001

您可以执行以下操作：

class TrieNode:
    __slots__ = "value", "children"
    def __init__(self):
        self.value = None
        self.children = {}

    def insert( self, word, value):
        node = self
        for letter in word:
            code_point = ord(letter)
            if code_point not in node.children: 
                node.children[code_point] = TrieNode()

            node = node.children[code_point]
        node.value = value #this serves as a signal that it is a word


    def get(word, default=None):
        val = self._get_value(word)
        if val is None:
            return default
        else:
            return val

    def _get_value(self, word):
        node = self
        for letter in word:
            code_point = ord(letter)
            try:
                node = node.children[code_point]
            except KeyError:
                return None
        return node.value

使用pyber

1 个答案: