我正在使用此代码生成测试和训练数据集,为其分类,并返回多个指标。但是,我得到了非常好的分数。我是过度拟人,还是仅仅是因为太过于难过?
#! /usr/bin/env python
'''
@author: nelson-liu
'''
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import mean_squared_error
alldata = pd.read_csv('alldata60.csv')
cols = [col for col in alldata.columns if col not in ['Survival months', 'Survived']]
X = alldata[cols].values
y = alldata["Survived"].values
Xr, Xt, yr, yt = train_test_split(X, y, random_state=6131997)
rfc = RandomForestClassifier(n_estimators=2000, oob_score=True)
rfc.fit(Xr, yr)
ypred = rfc.predict(Xt)
acc = rfc.score(Xt, yt)
scores = cross_val_score(rfc, Xr, yr, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()*2))
print mean_squared_error(yt, ypred)
print rfc.oob_score_
print accuracy_score(yt,ypred)
print acc
返回的值是:
Accuracy: 0.98 (+/- 0.00)
0.0245367883996 (MSE)
0.975742385929 (oob_score)
0.9754632116 (accuracy score)
0.9754632116 (random forest classifier class' score)
我注意确保我使用了一套坚固的测试装置,这样可以确保在我确实过度装配的时候看到了可怕的结果。然而,结果似乎仍然很好。作为ML新手,我肯定会欣赏第二双眼睛看看。
提前致谢!