Scikitlearn RandomForest过多的RAM使用率

时间:2016-06-23 18:07:00

标签: python machine-learning scikit-learn classification random-forest

我在1.3 GB数据集上训练RandomForestClassifier。该数据具有很少的列(< 10),并且有大约3000万个条目。由于在模型训练期间中途耗尽内存(32GB),我无法适应模型。

我看到here RandomForestClassifier的内存使用量应与2*n_jobs*size(X)成比例,在我的情况下约为2GB,因为我已将n_jobs限制为1。 ,这段代码占用了我的远程实例上的所有32 GB,但没有加起来。非常感谢帮助解决这个问题。

我用scikit-learn 0.17运行Python 2.7.6。这是代码:

import pandas as pd
from sklearn.cross_validation import train_test_split

from sklearn.ensemble import RandomForestClassifier

def run():
    print 'Loading DataFrame'
    df_train = pd.read_csv('data/train.csv')

    print 'Splitting train and test data'
    train, test = train_test_split(df_train, test_size=0.2)

    del df_train

    df = train
    features = ['x', 'y', 'accuracy', 'hour', 'day', 'week', 'month', 'year']

    df.loc[:, 'hours'] = df.time / float(60)
    df.loc[:, 'hour'] = df.hours % 24

    df.loc[:, 'days'] = df.time / float(60*24)
    df.loc[:, 'day'] = df.days % 7

    df.loc[:, 'weeks'] = df.time / float(60*24*7)
    df.loc[:, 'week'] = df.weeks % 52

    df.loc[:, 'months'] = df.time / float(60*24*30)
    df.loc[:, 'month'] = df.months % 12

    df.loc[:, 'year'] = df.time / float(60*24*365)

    model = RandomForestClassifier(n_jobs=1, warm_start=True)

    train_df = df.loc[:, features]
    values = df.loc[:, 'val']

    print 'Fitting Model'
    model.fit(train_df, values)

    wdf = test.sort_values('row_id').set_index('row_id')        
    expected = wdf.place_id
    wdf = wdf.loc[:, features]

    predictions = model.predict(wdf)
    actual = predictions
    print dict(zip(wdf.index, predictions))

    expect = pd.Series(expected)
    actual = pd.Series(actual)
    print (sum(expect == actual)/float(len(expected))) * 100

0 个答案:

没有答案