我有一个带有严格整数数据的4Gb CSV文件,我想读入pandas DataFrame。本机read_csv使用所有RAM(64Gb)并因MemoryError而失败。使用显式dtype,它只需要永远(尝试int和float类型)。
所以,我写了自己的读者:
def read_csv(fname):
import csv
reader = csv.reader(open(fname))
names = reader.next()[1:] # first row
dftype = np.float32
df = pd.DataFrame(0, dtype=dftype, columns=names, index=names)
for row in reader:
tag = row[0]
df.loc[tag] = np.array(row[1:], dtype=dftype)
return df
问题:如果dftype是np.int32(每行约20秒),行df.loc[tag] = np.array(row[1:], dtype=dftype)
会慢约1000倍,所以我最终使用了np.float64和return df.astype(np.int32)
(约4分钟)。我也尝试了Python转换([int / float(v)for v in row [1:]]),结果相同。
为什么会这样?
UPD:我在Python 2.7和3.5上有相同的行为
答案 0 :(得分:1)
更新:我的笔记本有16GB的内存,所以我会用4倍(64GB / 16Gb = 4)小DF来测试它:
设定:
In [1]: df = pd.DataFrame(np.random.randint(0, 10*6, (12000, 47395)), dtype=np.int32)
In [2]: df.shape
Out[2]: (12000, 47395)
In [3]: %timeit -n 1 -r 1 df.to_csv('c:/tmp/big.csv', chunksize=1000)
1 loop, best of 1: 5min 34s per loop
让我们以羽毛格式保存这个DF:
In [4]: import feather
In [6]: df = df.copy()
In [7]: %timeit -n 1 -r 1 feather.write_dataframe(df, 'c:/tmp/big.feather')
1 loop, best of 1: 8.41 s per loop # yay, it's bit faster...
In [8]: df.shape
Out[8]: (12000, 47395)
In [9]: del df
并回读:
In [10]: %timeit -n 1 -r 1 df = feather.read_dataframe('c:/tmp/big.feather')
1 loop, best of 1: 17.4 s per loop # reading is reasonably fast as well
从块中读取CSV文件要慢得多,但仍然没有给我MemoryError
:
In [2]: %%timeit -n 1 -r 1
...: df = pd.DataFrame()
...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000):
...: df = pd.concat([df, chunk])
...: print(df.shape)
...: print(df.dtypes.unique())
...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int64')]
1 loop, best of 1: 9min 25s per loop
现在让我们明确指定dtype=np.int32
:
In [1]: %%timeit -n 1 -r 1
...: df = pd.DataFrame()
...: for chunk in pd.read_csv('c:/tmp/big.csv', index_col=0, chunksize=1000, dtype=np.int32):
...: df = pd.concat([df, chunk])
...: print(df.shape)
...: print(df.dtypes.unique())
...:
(1000, 47395)
(2000, 47395)
(3000, 47395)
(4000, 47395)
(5000, 47395)
(6000, 47395)
(7000, 47395)
(8000, 47395)
(9000, 47395)
(10000, 47395)
(11000, 47395)
(12000, 47395)
[dtype('int32')]
1 loop, best of 1: 10min 38s per loop
测试HDF存储:
In [10]: %timeit -n 1 -r 1 df.to_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 22.5 s per loop
In [11]: del df
In [12]: %timeit -n 1 -r 1 df = pd.read_hdf('c:/tmp/big.h5', 'test')
1 loop, best of 1: 1.04 s per loop
如果您有机会更改存储文件格式 - 无论如何都不要使用CSV文件 - 使用HDF5(.h5)或羽毛格式...
OLD回答:
我只想使用原生的Pandas read_csv()方法:
chunksize = 10**6
reader = pd.read_csv(filename, index_col=0, chunksize=chunksize)
df = pd.concat([chunk for chunk in reader], ignore_indexes=True)
从你的代码:
tag = row [0]
df.loc [tag] = np.array(row [1:],dtype = dftype)
您希望将CSV文件中的第一列用作索引,因此:index_col=0
答案 1 :(得分:1)
我建议您使用numpy数组,例如:
def read_csv(fname):
import csv
reader = csv.reader(open(fname))
names = reader.next()[1:] # first row
n = len(names)
data = np.empty((n, n), np.int32)
tag_map = {name:i for i, name in enumerate(names)}
for row in reader:
tag = row[0]
data[tag_map[tag], :] = row[1:]
return names, data
我不知道为什么int32
比float32
慢,但DataFrame
按列存储数据,df.loc[tag] = ...
的每列设置元素都很慢。
如果您想要标签进行访问,可以使用xarray:
import xarray
d = xarray.DataArray(data, [("r", names), ("c", names)])