Question

我有一个335 MB的大文本文件。整个文本被标记化。每个标记由空格分隔。我希望将每个句子表示为单词列表，而整个文本是句子列表。这意味着我将获得一份清单。

我使用这种简单的代码安静将文本加载到我的主存储器中：

def get_tokenized_text(file_name):
    tokens = list()
    with open(file_name,'rt') as f:
        sentences = f.readlines()

    return [sent.strip().split(' ') for sent in sentences]

不幸的是，这种方法消耗的内存太多，我的笔记本电脑总是崩溃。我有4 GB内存，但大约五秒后就会出现拥塞。

为什么呢？该文本应占用约335 MB。即使我很慷慨，而且我已经批准让我们说只有管理资料的四倍内存，就没有理由让内存拥塞。我现在正在监督是否有任何内存泄漏？

Answer 1

列表和字符串是对象，对象具有占用内存空间的属性。您可以使用sys.getsizeof：

检查对象的大小和开销

>>> sys.getsizeof('')
49
>>> sys.getsizeof('abcd')
53
>>> sys.getsizeof([])
64
>>> sys.getsizeof(['a'])
72
>>> sys.getsizeof(['a', 'b'])
80

Answer 2

为什么呢？该文本应占用约335 MB。

假设文本是以UTF-8或各种单字节编码之一编码的 - 很可能 - 文本本身在Python 2中占用的空间略大于335 MB，但至少是文本的两倍并且可能是Python 3中的四倍，具体取决于您的实现。这是因为默认情况下Python 3字符串是Unicode字符串，它们在内部用每个字符两个或四个字节表示。

即使我很慷慨，而且我已经批准了，让我们说只有管理资料的四倍内存，就没有理由让内存拥堵。

但是有。每个Python对象都有相对较大的开销。例如，在CPython 3.4中，有一个引用计数，一个指向类型对象的指针，一些将对象链接在一起形成双向链表的附加指针，以及特定于类型的附加数据。几乎所有这些都是开销。忽略特定于类型的数据，只有三个指针和refcount表示64位构建中每个对象的32字节开销。

字符串有一个额外的长度，哈希码，数据指针和标志，每个对象大约24个字节（再次假设64位构建）。

如果你的单词平均为6个字符，那么每个单词在你的文本文件中占用大约6个字节，但是大约68个字节作为Python对象（在32位Python中可能只有40个字节）。这不计算列表的开销，这可能每个字至少增加8个字节，每个句子增加8个字节。

所以是的，12倍或12倍以上的扩展似乎不太可能。

我现在正在监督是否有任何内存泄漏？

不太可能。 Python在跟踪对象和收集垃圾方面做得非常好。您通常不会在纯Python代码中看到内存泄漏。

Answer 3

您正在同时在内存中保留多个数据表示形式。 readlines()中的文件缓冲区，也是sentences，并且在构建要返回的列表时再次出现。要减少内存，请一次处理一行文件。只有words才能保存文件的全部内容。

def get_tokenized_text(file_name):
    words = []
    f = open(file_name,'rt')
    for line in f:
        words.extend( x for x in line.strip().split(' ') if x not in words)
    return words

words = get_tokenized_text('book.txt')
print words

Answer 4

我的第一个回答是尝试通过不在同一时间将中间列表保留在内存中来减少内存使用量。但仍然无法将整个数据结构压缩到4GB的RAM中。

通过这种方法，使用由Project Gutenberg书籍组成的40MB文本文件作为测试数据，数据要求从270减少到55 MB。然后，一个355 MB的输入文件将占用大约500MB的内存，这有望适合。

此方法构建唯一单词的字典，并为每个单词分配唯一的整数标记（word_dict）。然后句子列表word_tokens使用整数标记而不是单词本身。然后word_dict交换了键和值，以便word_tokens中的整数标记可用于查找相应的单词。

我使用的是32位Python，它比64位Python使用更少的内存，因为指针的大小只有一半。

获取像list＆amp ;;这样的容器的总大小字典，我使用了Raymond Hettinger的http://code.activestate.com/recipes/577504/代码。它不仅包括容器本身，还包括子容器和它们指向的底层项目。

import sys, os, fnmatch, datetime, time, re

# Original approach
def get_tokenized_text(file_name):
    words = []
    f = open(file_name,'rt')
    for line in f:
        words.append( line.strip().split(' ') )
    return words

# Two step approach
# 1. Build a dictionary of unique words in the file indexed with an integer

def build_dict(file_name):
    dict = {}
    n = 0
    f = open(file_name,'rt')
    for line in f:
        words = line.strip().split(' ')
        for w in words:
            if not w in dict:
                dict[w] = n
                n = n + 1
    return dict

# 2. Read the file again and build list of sentence-words but using the integer indexes instead of the word itself

def read_with_dict(file_name):
    tokens = []
    f = open(file_name,'rt')
    for line in f:
        words = line.strip().split(' ')
        tokens.append( dict[w] for w in words )
    return tokens


# Adapted from http://code.activestate.com/recipes/577504/ by Raymond Hettinger 
from itertools import chain
from collections import deque

def total_size(o, handlers={}):
    """ Returns the approximate memory footprint an object and all of its contents.

    Automatically finds the contents of the following builtin containers and
    their subclasses:  tuple, list, deque, dict, set and frozenset.
    To search other containers, add handlers to iterate over their contents:

        handlers = {SomeContainerClass: iter,
                    OtherContainerClass: OtherContainerClass.get_elements}

    """
    dict_handler = lambda d: chain.from_iterable(d.items())
    all_handlers = {tuple: iter,
                    list: iter,
                    deque: iter,
                    dict: dict_handler,
                    set: iter,
                    frozenset: iter,
                   }
    all_handlers.update(handlers)     # user handlers take precedence
    seen = set()                      # track which object id's have already been seen
    default_size = sys.getsizeof(0)       # estimate sizeof object without __sizeof__

    def sizeof(o):
        if id(o) in seen:       # do not double count the same object
            return 0
        seen.add(id(o))
        s = sys.getsizeof(o, default_size)

        for typ, handler in all_handlers.items():
            if isinstance(o, typ):
                s += sum(map(sizeof, handler(o)))
                break
        return s
    return sizeof(o)

# Display your Python configurstion? 32-bit Python takes about half the memory of 64-bit
import platform
print platform.architecture(), sys.maxsize          # ('32bit', 'WindowsPE') 2147483647

file_name = 'LargeTextTest40.txt'                   # 41,573,429 bytes

# I ran this only for a size comparison - don't run it on your machine
# words = get_tokenized_text(file_name)
# print len(words), total_size(words)               # 962,632  268,314,991

word_dict = build_dict(file_name)
print len(word_dict), total_size(word_dict)         # 185,980  13,885,970

word_tokens = read_with_dict(file_name)
print len(word_tokens), total_size(word_tokens)     # 962,632  42,370,804

# Reverse the dictionary by swapping key and value so the integer token can be used to lookup corresponding word
word_dict.update( dict((word_dict[k], k) for k in word_dict) )

为什么文本表示作为列表消耗了如此多的内存？

4 个答案: