训练机器学习模型时内存不足

时间:2021-07-24 23:55:35

标签: python memory scikit-learn out-of-memory random-forest

我的记忆力有限,训练这个模型太费力了:

import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np



clf = RandomForestClassifier(n_estimators=10)
print("Created Random Forest classifier\n")

data = pd.read_csv("House_2_ALL.csv")
print("Finished reading data\n")

data.drop("UnixTimeStamp",1)
predict = "Aggregate_Power"
print("Dropped UnixTimeStamp\n")

X = np.array(data.drop([predict],1))
Y = np.array(data[predict])
print("Created numpy Arrays\n")

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.1)
print("Assigned Testing/Training Variables\n")

clf.fit(X_train, Y_train)
print("Fit model\n")

print("Attempting to predict\n")
print(clf.predict(X_test))

当我运行这个程序时,我的计算机指出它的内存不足,我需要退出一些应用程序。 关于如何更好地管理内存的任何想法,或者是减少训练数据集大小的唯一解决方案?

我了解到程序运行平稳,直到到达“clf.fit(X_train, Y_train)”行,所以我不知道这是熊猫内存饥渴的数据帧还是 sklearn 的问题。

2 个答案:

答案 0 :(得分:1)

在我看来,您的数据集非常大。因此,您应该分部分加载数据集以训练模型。我会分享一个例子

df = pd.read_csv(dataset_path, chunksize=10000)
# This will load only 10000 rows at a time (you can tune for your RAM)

# Now the df is a generator and hence you can do something like this
for part_df in df:
  '''
  Now here you just consider the "part_df" as your original df and do all
the preprocessing and stuff on it and train the model on it. After training
the model on this part_df you save the model and reload it in the next iteration.
  '''
  df = preprocess_df(df) # Some preprocessing function
  xtrain, xvalid, ytrain, yvalid = train_test_split(df) # Some split
  model = None
  if (os.exists(model_path)): # you won't have a model for first iteration
    model = # Here you load the model
  else:
    model = # Define the model for first iteration of df

  model.fit(...) # train the model

  # Now you save the model for the next iteration

答案 1 :(得分:0)

这里有两种可能的情况会导致 event.requestContext.identity.sourceIp

1.Pandas.read_csv() with chunk_size

您可以使用 chunk_size 参数并一次加载一个较小的数据块(返回一个我们可以迭代的对象)。

Memory error

1.随机森林分类器/回归器

它有默认参数 chunk_size=50000 reader = pd.read_csv('big_file.csv', chunksize=chunk_size) for i in range(num): data_chunk = next(reader) # process chunk ,max_depth=None 这意味着全树生长。如果数据集很大,那么 RandomForest 可以生长完全深的树和节点,从而导致更快的内存消耗。

min_samples_leaf=1clf = RandomForestClassifier()

然后你可以检查一些东西

clf.fit(X_train, y_train) print(clf.estimators_[0].tree_.max_depth) # max_depth on a chunk of data.

现在您可以为超参数 joblib.dump(clf.estimators_[0], "first_tree_clf.joblib") # get the size of a tree. 尝试一个确定的值,然后再次拟合模型。随机森林分类器模型超参数的调整将创建每个块的浅树并避免过多的内存消耗