Question

我的出发点是NumPy函数loadtxt的问题：

X = np.loadtxt(filename, delimiter=",")

在MemoryError中提供了np.loadtxt(..)。我用Google搜索并来到this question on StackOverflow。这提供了以下解决方案：

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

data = iter_loadtxt('your_file.ext')

所以我试过了，但后来遇到了以下错误信息：

> data = data.reshape((-1, iter_loadtext.rowlength))
> ValueError: total size of new array must be unchanged

然后我尝试将代码片段添加到代码中的行数和最大cols数，这部分得到了from another question并部分写了自己：

num_rows = 0
max_cols = 0
with open(filename, 'r') as infile:
    for line in infile:
        num_rows += 1
        tmp = line.split(",")
        if len(tmp) > max_cols:
            max_cols = len(tmp)

def iter_func():
    #didn't change

data = np.fromiter(iter_func(), dtype=dtype, count=num_rows)
data = data.reshape((num_rows, max_cols))

但是这仍然给出了相同的错误信息，虽然我认为应该已经解决了。另一方面，我不确定我是否以正确的方式呼叫data.reshape(..)。

我评论了调用date.reshape(..)的规则，看看发生了什么。这给出了以下错误消息：

> ValueError: need more than 1 value to unpack

在X完成某项操作的第一点发生了这一问题，这是该问题的全部变量。

我知道这段代码可以处理我得到的输入文件，因为我看到它与它们一起使用。但我无法找到解决这个问题的原因。我的理由是因为我使用32位Python版本（在64位Windows机器上），在其他计算机上没有发生的内存出现问题。但我不确定。有关信息：我有一个1.2GB文件的8GB内存，但根据任务管理器，我的RAM不满。

我想要解决的是，我使用的开源代码需要像np.loadtxt(filename, delimiter=",")那样读取和解析给定的文件，但是在我的记忆中。我知道代码最初在MacOsx和Linux中工作，更准确：＆＃34; MacOsx 10.9.2和Linux（版本2.6.18-194.26.1.el5（brewbuilder@norob.fnal.gov）（gcc版本） 4.1.2 20080704（Red Hat 4.1.2-48））1 SMP Tue Tue Nov 9 12:46:16 EST 2010）。＆＃34;

我不太关心时间。我的文件包含+ -200.000行，其中每行有100或1000（取决于输入文件：一个总是100，一个总是1000）项，其中一个项是一个浮点，其中3个小数被否定或者没有它们由,和空格分隔。 F。：[..] 0.194, -0.007, 0.004, 0.243, [..]，以及其中100或100个您看到4的项目，+ - 200.000行。

我使用的是Python 2.7，因为开源代码需要它。

你们有没有解决方案？提前谢谢。

Answer 1

在Windows上，32位进程最多只能获得2GB（或GiB？）内存，而numpy.loadtxt因内存繁重而臭名昭着，因此这就解释了为什么第一种方法无法正常工作。

您遇到的第二个问题是您正在测试的特定文件缺少数据，即并非所有行都具有相同数量的值。这很容易检查，例如：

import numpy as np

numbers_per_line = []
with open(filename) as infile:
    for line in infile:
        numbers_per_line.append(line.count(delimiter) + 1)

# Check where there might be problems
numbers_per_line = np.array(numbers_per_line)
expected_number = 100
print np.where(numbers_per_line != expected_number)

np.loadtxt和iter_loadtxt中的Python MemoryError或ValueError

1 个答案: