Question

我有一个Pandas数据框，例如data。

在32位，2 GB RAM的笔记本电脑上，我正在这样做：

>>>data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 5 columns):
UserID        1000209 non-null int32
MovieID       1000209 non-null int32
Ratings       1000209 non-null int32
Age           1000209 non-null int32
Occupation    1000209 non-null int32
dtypes: int32(5)
memory usage: 58.7 MB

在此数据框上，我正在对RandomForest进行分类-

>>>X = data.drop('Ratings', axis = 1)
>>>y = data['Ratings']

>>>from sklearn.model_selection import train_test_split
>>>Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=1)

>>>from sklearn.ensemble import RandomForestClassifier
>>>model = RandomForestClassifier(n_estimators=100, random_state=0)
>>>model.fit(Xtrain, ytrain)
>>>model.predict(Xtest)

但是它抛出以下错误

MemoryError: could not allocate 50331648 bytes

我觉得这与我使用的笔记本电脑规格有关，但我仍然不明白为什么会这样。反正我能解决这个问题吗？

Answer 1

最好的方法是探查脚本的内存使用情况。为此，

安装memory_profiler：pip install --user memory_profiler

将所有代码放入一个函数中，以逐行对其进行概要分析。类似于以下内容：

from memory_profiler import profile

@profile
def main_model_training()
    # put all the code in here

然后按如下所示开始分析：

python -m memory_profiler script_name.py

这里是一个示例：

给出以下脚本：

from memory_profiler import profile
import pandas as pd
import numpy as np

@profile
def something_to_profile():
    df = pd.DataFrame(np.random.randn(1000, 4), columns=list('ABCD'))
    df.count()

something_to_profile()

按如下所示运行配置文件：

python -m memory_profiler memory_profiling_test.py

逐行给出以下内存配置文件：

Line #    Mem usage    Increment   Line Contents
================================================
     5     64.3 MiB     64.3 MiB   @profile
     6                             def something_to_profile():
     7     64.3 MiB      0.0 MiB       df = pd.DataFrame(np.random.randn(1000, 4), columns=list('ABCD'))
     8     64.3 MiB      0.0 MiB       df.count()

如何调试/解决由熊猫DataFame引起的MemoryError？

1 个答案: