将数据腌制到磁盘

Question

我有一个51K X 8.5K数据帧，只有二进制（1或0）值。

我写了以下代码：

将数据腌制到磁盘

outfile=open("df_preference.p", "wb")
pickle.dump(df_preference,outfile)
outfile.close()

它引发了我的内存错误，如下所示：

MemoryError                               Traceback (most recent call last)
<ipython-input-48-de66e880aacb> in <module>()
      2 
      3 outfile=open("df_preference.p", "wb")
----> 4 pickle.dump(df_preference,outfile)
      5 outfile.close()

我认为这意味着这些数据很大而且无法腌制？但它只有二进制值。

在此之前，我从另一个具有正常计数和大量零的数据框创建了此数据集。使用以下代码：

df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))

这本身需要一些时间来创建df_preference。相同尺寸的矩阵。

我担心的是，如果使用applymap创建数据框需要花费时间，而且ii）由于内存错误甚至没有腌制数据帧，那么继续我需要使用SVD对此df_prefence进行矩阵分解和交替的最小二乘。那会更慢吗？如何解决这个慢速运行并解决内存错误？

由于

Answer 1

<强>更新

对于1和0值，您可以使用int8（1字节）dtype，这会将内存使用量减少至少4倍。

(df_recommender > 0).astype(np.int8).to_pickle('/path/to/file.pickle')

以下是51K x 9K数据帧的示例：

In [1]: df = pd.DataFrame(np.random.randint(0, 10, size=(51000, 9000)))

In [2]: df.shape
Out[2]: (51000, 9000)

In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int32(9000)
memory usage: 1.7 GB

源DF需要1.7 GB的内存

In [6]: df_preference = (df>0).astype(int)

In [7]: df_preference.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int32(9000)
memory usage: 1.7 GB

结果DF再次需要1.7 GB的内存

In [4]: df_preference = (df>0).astype(np.int8)

In [5]: df_preference.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int8(9000)
memory usage: 437.7 MB

int8 dtype只需438 MB

现在让我们把它保存为Pickle文件：

In [10]: df_preference.to_pickle('d:/temp/df_pref.pickle')

文件大小：

{ temp }  » ls -lh df_pref.pickle
-rw-r--r-- 1 Max None 438M May 28 09:20 df_pref.pickle

OLD回答：

试试这个：

(df_recommender > 0).astype(int).to_pickle('/path/to/file.pickle')

Explanataion：

In [200]: df
Out[200]:
   a  b  c
0  4  3  3
1  1  2  1
2  2  1  0
3  2  0  1
4  2  0  4

In [201]: (df>0).astype(int)
Out[201]:
   a  b  c
0  1  1  1
1  1  1  1
2  1  1  0
3  1  0  1
4  1  0  1

PS您可能还想将DF保存为HDF5文件而不是Pickle - 有关详细信息，请参阅this comparison

Answer 2

我遇到一个内存错误，将大约8.5GB的DataFrame保存到泡菜中。原因是RAM不足。这一切都适用于带有Python 3.7.6的Jupyter Notebook

尝试df.to_pickle()使用默认参数和df.to_hdf( ..., mode="w")。

两个都给了我MemoryError，因为在保存为这些格式时，该进程分配了额外的内存（HDF显然也在内部使用了pickle）。

我最终成功保存为CSV：pd.to_csv()，因为它不占用大量额外的内存资源。

这是我正在处理的DataFrame的df.info（）输出：

<class 'pandas.core.frame.DataFrame'>
Index: 141516896 entries, 1eedd4a85d23 to 1c0088d397a3
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   id             object 
 1   value          float64
 2   value2         object 
 3   value3         object 
 4   value4         object 
 5   value5         object 
 6   value6         object 
dtypes: float64(1), object(6)
memory usage: 8.4+ GB

生成的文件约为15Gb，但至少我还没有丢失数据。

希望能帮助别人。

将数据帧pickle到磁盘时出现内存错误

将数据腌制到磁盘

2 个答案: