Python 3.1 - 大型列表采样期间的内存错误

时间:2011-01-16 15:22:39

标签: python

输入列表可以超过100万个数字。当我使用较小的'重复'运行以下代码时,它很好;

def sample(x):
    length = 1000000 
    new_array = random.sample((list(x)),length)
    return (new_array)

def repeat_sample(x):    
    i = 0
    repeats = 100
    list_of_samples = []
    for i in range(repeats):
       list_of_samples.append(sample(x))
    return(list_of_samples)

repeat_sample(large_array)

但是,使用高重复项(例如上面的100)会产生MemoryError。回溯如下;

Traceback (most recent call last):
  File "C:\Python31\rnd.py", line 221, in <module>
    STORED_REPEAT_SAMPLE = repeat_sample(STORED_ARRAY)
  File "C:\Python31\rnd.py", line 129, in repeat_sample
    list_of_samples.append(sample(x))
  File "C:\Python31\rnd.py", line 121, in sample
    new_array = random.sample((list(x)),length)
  File "C:\Python31\lib\random.py", line 309, in sample
    result = [None] * k
MemoryError

我假设我的内存不足。我不知道如何解决这个问题。

感谢您的时间!

5 个答案:

答案 0 :(得分:5)

扩展我的评论:

假设您对每个样本所做的处理是计算其均值。

def mean(samplelists):
    means = []
    n = float(len(samplelists[0]))
    for sample in samplelists:
        mean = sum(sample)/n
        means.append(mean)
    return means

calc_means(repeat_sample(large_array))

这会让你在内存中保留所有这些列表。你可以这样轻得多:

def mean(sample, n):
    n = float(n)
    mean = sum(sample)/n
    return mean

def sample(x):
    length = 1000000 
    new_array = random.sample(x, length)
    return new_array

def repeat_means(x):    
    repeats = 100
    list_of_means = []
    for i in range(repeats):
        list_of_means.append(mean(sample(x)))
    return list_of_means    

repeat_means(large_array)

但这仍然不够好......只有构建你的结果列表才能做到这一切:

import random

def sampling_mean(population, k, times):
    # Part of this is lifted straight from random.py
    _int = int
    _random = random.random

    n = len(population)
    kf = float(k)
    result = []

    if not 0 <= k <= n:
        raise ValueError, "sample larger than population"

    for t in range(times):
        selected = set()
        sum_ = 0
        selected_add = selected.add

        for i in xrange(k):
            j = _int(_random() * n)
            while j in selected:
                j = _int(_random() * n)
            selected_add(j)
            sum_ += population[j]

        mean = sum_/kf
        result.append(mean)
    return result

sampling_mean(x, 1000000, 100)

现在,您的算法可以像这样精简吗?

答案 1 :(得分:4)

两个答案:

  1. 除非您使用旧机器,否则您实际上不太可能耗尽内存。你得到一个MemoryError,因为你可能正在使用一个32位的Python版本,并且你不能分配超过2GB的内存。

  2. 你的做法是错误的。您应该使用随机样本生成器而不是构建样本列表。

答案 2 :(得分:1)

random.sample()的生成器版本也会有所帮助:

from random import random
from math import ceil as _ceil, log as _log

def xsample(population, k):
    """A generator version of random.sample"""
    n = len(population)
    if not 0 <= k <= n:
        raise ValueError("sample larger than population")
    _int = int
    setsize = 21        # size of a small set minus size of an empty list
    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize or hasattr(population, "keys"):
        # An n-length list is smaller than a k-length set, or this is a
        # mapping type so the other algorithm wouldn't work.
        pool = list(population)
        for i in range(k):         # invariant:  non-selected at [0,n-i)
            j = _int(random() * (n-i))
            yield pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy
    else:
        try:
            selected = set()
            selected_add = selected.add
            for i in range(k):
                j = _int(random() * n)
                while j in selected:
                    j = _int(random() * n)
                selected_add(j)
                yield population[j]
        except (TypeError, KeyError):   # handle (at least) sets
            if isinstance(population, list):
                raise
            for x in sample(tuple(population), k):
                yield x

答案 3 :(得分:0)

您可以做的唯一改进是将代码更改为:

list_of_samples = [random.sample(x, length) for _ in range(repeats)]

然而,这不会改变你无法在现实世界中创建任意长度列表的事实。

答案 4 :(得分:0)

您可以尝试使用数组对象http://docs.python.org/py3k/library/array.html。它应该比列表更有效,但可能更难使用。