Question

我有一些非常大的txt文件（大约1.5 GB），我想把它作为一个数组加载到Python中。问题在于此数据中逗号用作小数分隔符。对于较小的文件我想出了这个解决方案：

import numpy as np
data= np.loadtxt(file, dtype=np.str, delimiter='\t', skiprows=1)
        data = np.char.replace(data, ',', '.')
        data = np.char.replace(data, '\'', '')
        data = np.char.replace(data, 'b', '').astype(np.float64)

但是对于大型文件，Python会遇到内存错误。还有其他更有效的内存加载方法吗？

Answer 1

可能是您的1.5 GB文件需要比1.5 GB RAM

尝试将其拆分为行

获取更多信息：

http://stupidpythonideas.blogspot.ch/2014/09/why-does-my-100mb-file-take-2gb-of.html#!/2014/09/why-does-my-100mb-file-take-2gb-of.html

Answer 2

np.loadtxt（文件，dtype = np.str，delimiter ='\ t'，skiprows = 1）的问题在于它使用python对象（字符串）而不是float64，这对内存效率非常低。您可以使用pandas read_table

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html#pandas.read_table

读取您的文件并设置decimal ='，'以更改默认行为。这将允许无缝读取并将您的字符串转换为浮点数。在加载pandas数据帧后，使用df.values来获取numpy数组。如果它仍然太大，你的内存使用块

http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

如果仍然没有运气，请尝试使用np.float32格式进一步减少内存占用。

Answer 3

你应该尝试自己解析它，迭代每一行（所以使用不能将所有文件读入内存的生成器隐式）。此外，对于那个大小的数据，我会使用python标准array库，它使用类似的内存作为c数组。也就是说，内存（numpy数组中的另一个值旁边的值也非常有效。）

import array

def convert(s): 
  # The function that converts the string to float
  s = s.strip().replace(',', '.')
  return float(s)

data = array.array('d') #an array of type double (float of 64 bits)

with open(filename, 'r') as f:
    for l in f: 
        strnumbers = l.split('\t')
        data.extend( (convert(s) for s in strnumbers if s!='') )
        #A generator expression here.

我确信可以编写类似的代码（具有类似的内存占用），用array.array代替numpy.array，特别是如果你需要一个二维数组。

Python：使用逗号作为小数分隔符加载数据

3 个答案: