我陷入了数据科学问题
我正在尝试使用随机森林预测未来的一些课程
我的功能是分类和数字
我的班级不平衡
当我运行我的拟合时,分数似乎非常好,但交叉验证很糟糕
我的模特必须过度适应。
这是我的代码:
features_cat = ["area", "country", "id", "company", "unit"]
features_num = ["year", "week"]
classes = ["type"]
print("Data",len(data_forest))
print(data_forest["type"].value_counts(normalize=True))
X_cat = pd.get_dummies(data_forest[features_cat])
print("Cat features dummies",len(X_cat))
X_num = data_forest[features_num]
X = pd.concat([X_cat,X_num],axis=1)
X.index = range(1,len(X) + 1)
y = data_forest[classes].values.ravel()
test_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
forest = RandomForestClassifier(n_estimators=50, n_jobs=4, oob_score=True, max_features="log2", criterion="entropy")
forest.fit(X_train, y_train)
score = forest.score(X_test, y_test)
print("Score on Random Test Sample:",score)
X_BC = X[y!="A"]
y_BC = y[y!="A"]
score = forest.score(X_BC, y_BC)
print("Score on only Bs, Cs rows of all dataset:",score)
这是输出:
Data 768296
A 0.845970
B 0.098916
C 0.055114
Name: type, dtype: float64
Cat features dummies 725
Score on Random Test Sample: 0.961434335546
Score on only Bs, Cs rows of all dataset: 0.959194193052
到目前为止,我对模特感到满意......
但是当我试图预测未来的日期时,它会给出大致相同的结果。
我检查交叉验证:
rf = RandomForestClassifier(n_estimators=50, n_jobs=4, oob_score=True, max_features="log2", criterion="entropy")
scores = cross_validation.cross_val_score(rf, X, y, cv=5, n_jobs=4)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
它给我的结果很差......
Accuracy: 0.55 (+/- 0.57)
我想念什么?
答案 0 :(得分:0)
如果您更改(或删除)random_state
怎么办?默认情况下,train_test_split
不是分层的,因此您的分类器可能只是预测最常见的类A
,而您对该分区的测试拆分只包含A
&#39}。第