Question

我在编码时比较新，并且在python中从输出文件中获取大量（~2.0 GB）数据并将其转换为可读和排序列表。我的主要问题是创建该大小的测试文件。输入文件将是一个长数组，大约是2.56 * 10 ^ 8（行）乘以1（列）最终结果是大约6.4 * 10 ^ 7（行）乘4（列）数组并显示它。为了创建一个示例数组，我一直在使用这个代码（请注意，这个代码的大小不是最终的，它的大小与我通过增加2的幂的大小一样大。）

import numpy as np
import subprocess as subp
from array import array

keepData = 1

if(not keepData):
  subp.call(['rm', 'Bertha.DAT']) #removes previous file if present

girth = int(8e6) #number of final rows

girthier = girth*4
bigger_tim = np.zeros(girthier) #initial array

File = 'Bertha.DAT'
bid = open(File, 'wb')
for ii in range(0,girth):
    tiny_tim = 100*(2*np.random.rand(1,3)-1)
    bigger_tim[ii*4]=4
    bigger_tim[ii*4+1]=tiny_tim[0,0]
    bigger_tim[ii*4+2]=tiny_tim[0,1]
    bigger_tim[ii*4+3]=tiny_tim[0,2]
    #for loop that inputs values in the style of the input result

line.tofile(bid) #writes into file
bid.close()

此代码适用于创建250MB的文件，但它们不能创建大于250MB的文件。非常感谢任何帮助。

编辑：

我还在添加第二个代码，看看是否存在大量内存使用问题。

import numpy as np
import pandas as pd

girth = int(24e6)

Matrix = np.zeros((girth,4))

Bertha = np.fromfile('Bertha.DAT',dtype = float,count = -1, sep = "")

for jj in range(0,girth):
    Matrix[jj,0] = Bertha[jj*4]
    Matrix[jj,1] = Bertha[jj*4+1]
    Matrix[jj,2] = Bertha[jj*4+2]
    Matrix[jj,3] = Bertha[jj*4+3] 

Table = pd.DataFrame({'Atomic Number':Matrix[:,0], 'X Position':Matrix[:,1], 'Y Position':Matrix[:,2], 'Z Position':Matrix[:,3]})
print Table

编辑：第一个代码运行时最多24e6作为＆＃34; girth＆＃34;值，但使用32e6获得以下错误：

Traceback (most recent call last):

  File "<ipython-input-1-cb13d37b70b9>", line 1, in <module>
    runfile('D:/WinPython-32bit-2.7.6.3/Big_Bertha.py', wdir='D:/WinPython-32bit-2.7.6.3')

   File "D:\WinPython-32bit-2.7.6.3\python-2.7.6\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 540, in runfile
execfile(filename, namespace)

  File "D:/WinPython-32bit-2.7.6.3/Big_Bertha.py", line 19, in <module>
    bigger_tim = np.zeros(girthier) #initial array

MemoryError

由于内存不足，我看起来无法创建初始虚拟矩阵来存储值。

第二个有一个非常相似的问题，但在24e6有一个不同的错误作为＆＃34; girth＆＃34;值。

Traceback (most recent call last):

  File "<ipython-input-1-268052dcc4e8>", line 1, in <module>
    runfile('D:/WinPython-32bit-2.7.6.3/binaryReader.py', wdir='D:/WinPython-32bit-2.7.6.3')

   File "D:\WinPython-32bit-2.7.6.3\python-2.7.6\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 540, in runfile
    execfile(filename, namespace)

  File "D:/WinPython-32bit-2.7.6.3/binaryReader.py", line 14, in <module>
    Bertha = np.fromfile('Bertha.DAT',dtype = float,count = -1, sep = "")

MemoryError

Answer 1

您收到的错误来自python无法分配更多内存的事实。

在第二个示例中，您将分配一个包含3200万行和4列的numpy表。使用通常的双精度浮子，仅此为1 GiB。行np.fromfile =需要加载一个非常大的文件，因为文件长度应该与Matrix匹配，即您需要从文件中获得至少1 GiB的数据。

它是：1 GiB + 1 GiB = 2 GiB，这是32位python的最大值。（这就是为什么有2400万行可以。）这就是为什么在从文件加载数据时抛出错误的原因。此外，对于用户数据，限制不是2 GiB，总共为2 GiB，实际上可能要少得多。

你可以做几件事：

不要创建空表。从文件加载数据并将其重新整形为您想要的形状（四列和所需的行数）：

m = np.fromfile("Bertha.DAT").reshape(-1,4)
使用其他内容作为数据类型而不是float（这是一个8位双）。如果您没有遇到精度问题，请使用'float32'（或'f4'）。但是，您无法更改np.fromfile中的数据类型，因为它确定了文件中的数据类型和顺序。
使用64位python。如果您处理大数据，那就是要走的路。在某些情况下它会消耗更多的内存（在其他一些情况下会消耗很多内存，但在numpy中却没有），但是如果你的计算机有很多内存，那么即使非常大的表也能很好地工作。

如果您有兴趣了解对象占用内存的程度，sys模块有一个很好的功能sys.getsizeof用于它们，例如sys.getsizeof(Bertha)。

您的代码中可能需要修复一些风格。一个是关于命名变量，它们应该是小写的（类名是大写的）。对于此类信息，阅读PEP 8建议非常有用。（在任何情况下，名称Matrix都有点不幸，因为有一些名为numpy.matrix的东西。）

引起我注意的另一件事是你正在使用for循环迭代一个numpy数组。这通常是以非常缓慢的方式完成某事的警告信号。在极少数情况下您需要这样做，但通常有非常简洁和快速的方法来操作数组。

Answer 2

最大的问题是，当您需要保留在内存中的最多数据是4行时，您正试图将所有内容保留在内存中。这个快速而又脏的代码使用的内存不仅仅是刚刚加载的Python 2.7解释器。

#!python2

import sqlite3

def make_narrow_file(rows, path):
    """make a text data file in path with rows of elements"""
    with open(path, 'w') as outf:
        for i in xrange(rows):
            outf.write(hex(i) + '\n')

def widen_file(inpath, outpath):
    """tranforms the single column in inpath to four columns in outpath"""
    inf = open(inpath)
    compose = []
    with open(outpath, 'w') as outf:
        for line in inf:
            compose.append(line.rstrip())
            if len(compose) == 4:
                outf.write(' '.join(compose))
                outf.write('\n')
                compose = []
    inf.close()

# But flat files are an inconvenient way for dealing with massive data.
# Put another way, flat ascii files are degenerate databases, so we'll use
# the sqlite database which is built into Python.

def create_database(db_path):
    """creates a database schema to hold 4 strings per row"""
    conn = sqlite3.connect(db_path)
    c = conn.cursor()
    c.execute('drop table if exists wide')
    c.execute('create table wide (a, b, c, d)')
    conn.close()

def fill_database(inpath, db_path):
    """tranforms the single column of data in inpath to four columns in 
       db_path"""
    inf = open(inpath)
    conn = sqlite3.connect(db_path, isolation_level='DEFERRED')
    cur = conn.cursor()
    compose = []
    for line in inf:
        compose.append(line.rstrip())
        if len(compose) == 4:
            cur.execute('insert into wide values(?, ?, ?, ?)', compose)
            compose = []
    conn.commit()
    inf.close()

if __name__ == '__main__':
    make_narrow_file(int(2e8), 'bertha.dat')
    widen_file('bertha.dat', 'berthaw.dat')

    create_database('berthaw.db')
    fill_database('bertha.dat', 'berthaw.db')

创建和写入时的MemoryError

2 个答案: