Question

下面有三个屏幕截图。前两个只是通过输入一个将csv读入数据帧（pandas.read_csv）的命令来显示内存中的差异。

第三个是数据框的.info()，说明数据帧正在使用多少内存。

这些数字并没有加起来。

https://www.dropbox.com/s/9bda421ukwewoef/Screenshot%202014-12-08%2018.09.35.png?dl=0
https://www.dropbox.com/s/bxx0wczdz7sfhcn/Screenshot%202014-12-08%2018.13.11.png?dl=0
https://www.dropbox.com/s/qf20yhpn7w9fmld/Screenshot%202014-12-08%2018.13.44.png?dl=0

具体来说，df.info()命令表示数据帧使用了~200 MB。可用内存的差异大约为700 MB（根据着名的linuxatemyram.com网站，我正在查看中间行）。

太可怕了！这是可重复的。这是一个错误吗？或者是pandas.read_csv方法结束时没有发布的内容。

感谢。

Answer 1

创建一个简单的int和object dtypes框架。也可以使用Categoricals创建类似的内容。

In [1]: df_object = DataFrame({'A' : np.random.randn(5), 'B' : Series(['a','foo','bar','a really long string','baz'])})

In [4]: df_object = pd.concat([df_object]*100000,ignore_index=True)

In [2]: df_cat = df_object.copy()

In [3]: df_cat['B'] = df_cat['B'].astype('category')

In [5]: df_cat = pd.concat([df_cat]*100000,ignore_index=True)

以下是.info()在0.15.1中显示的内容。注意'+'

In [10]: df_object.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 499999
Data columns (total 2 columns):
A    500000 non-null float64
B    500000 non-null object
dtypes: float64(1), object(1)
memory usage: 11.4+ MB

表示对象指针的内存（int64的内存），但不是实际的字符串存储。

In [6]: def as_mb(v):
   ...:         return "%.1f MB" % (v/(1024.0*1024))
   ...:

这是python实际执行的内存使用情况。这是上述用法的补充。 IOW，这是框架PLUS对象的存储。（有可能python 3实际上使用的更少，因为它可能会稍微优化一下）。

In [13]: import sys

In [14]: as_mb(sum(map(sys.getsizeof,df_object['B'].values)))
Out[14]: '20.5 MB'

如果您将此表示为可变长度字符串（目前不可能，但具有指导性）

In [16]: as_mb(sum([ len(b) for b in df_object['B'] ]))
Out[16]: '2.9 MB'

如果你把它转换为numpy固定长度的dtype（pandas会重新转换它，所以这在pandas目前是不可能的。）

In [17]: df_object['B'].values.astype(str).dtype
Out[17]: dtype('S20')

# note that this is marginal (e.g. in addition to the above). I have subtracted out
# the int64 pointers to avoid double counting
In [19]: as_mb(df_object['B'].values.astype(str).nbytes - 8*len(df_object['B']))
Out[19]: '5.7 MB'

如果您转换为分类类型。请注意，内存使用量是类别数量的函数，IOW，如果您有完全唯一的值，这将无济于事。

In [11]: df_cat.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 499999
Data columns (total 2 columns):
A    500000 non-null float64
B    500000 non-null category
dtypes: category(1), float64(1)
memory usage: 8.1 MB

Pandas Dataframe内存read_csv

1 个答案: