我在1.3 GB数据集上训练RandomForestClassifier。该数据具有很少的列(< 10),并且有大约3000万个条目。由于在模型训练期间中途耗尽内存(32GB),我无法适应模型。
我看到here RandomForestClassifier
的内存使用量应与2*n_jobs*size(X)
成比例,在我的情况下约为2GB,因为我已将n_jobs限制为1。 ,这段代码占用了我的远程实例上的所有32 GB,但没有加起来。非常感谢帮助解决这个问题。
我用scikit-learn 0.17运行Python 2.7.6。这是代码:
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
def run():
print 'Loading DataFrame'
df_train = pd.read_csv('data/train.csv')
print 'Splitting train and test data'
train, test = train_test_split(df_train, test_size=0.2)
del df_train
df = train
features = ['x', 'y', 'accuracy', 'hour', 'day', 'week', 'month', 'year']
df.loc[:, 'hours'] = df.time / float(60)
df.loc[:, 'hour'] = df.hours % 24
df.loc[:, 'days'] = df.time / float(60*24)
df.loc[:, 'day'] = df.days % 7
df.loc[:, 'weeks'] = df.time / float(60*24*7)
df.loc[:, 'week'] = df.weeks % 52
df.loc[:, 'months'] = df.time / float(60*24*30)
df.loc[:, 'month'] = df.months % 12
df.loc[:, 'year'] = df.time / float(60*24*365)
model = RandomForestClassifier(n_jobs=1, warm_start=True)
train_df = df.loc[:, features]
values = df.loc[:, 'val']
print 'Fitting Model'
model.fit(train_df, values)
wdf = test.sort_values('row_id').set_index('row_id')
expected = wdf.place_id
wdf = wdf.loc[:, features]
predictions = model.predict(wdf)
actual = predictions
print dict(zip(wdf.index, predictions))
expect = pd.Series(expected)
actual = pd.Series(actual)
print (sum(expect == actual)/float(len(expected))) * 100