我正在运行一些机器学习代码,部分代码如下:
classifiers = [XGBClassifier(), DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]
print("Original data")
print("=============")
print(features.shape)
for name, clf in zip(names, classifiers):
print(name)
X_train, X_test, y_train, y_test = train_test_split(features, loan_status, test_size = 0.2, random_state = 0)
result = train_predict(clf, len(y_train), X_train, y_train, X_test, y_test)
print(result)
print('-----------------------------------')
print("PCA data")
print("=============")
for pca_comp in range(1,6):
print("PCA component size: " + str(pca_comp))
pca = decomposition.PCA(n_components=pca_comp)
pca.fit(features)
features_pca = pca.transform(features)
for name, clf in zip(names, classifiers):
X_train, X_test, y_train, y_test = train_test_split(features_pca, loan_status, test_size = 0.2, random_state = 0)
result = train_predict(clf, len(y_train), X_train, y_train, X_test, y_test)
print(result)
print('-----------------------------------')
有效地,我正在遍历多个分类器并打印其结果。 然后,我遍历不同的n_component大小进行PCA分解,然后再次在所有分类器上运行。
我发现,一旦我开始进行PCA,无论使用什么分类器或选择n_component的值是多少,准确性(acc_test和acc_train)都保持不变。
这是此部分代码的输出。 请注意,一旦PCA启动,“ acc_test”始终为0.8079021551332182。
很遗憾,我无法共享数据。 但是,我正在寻找代码中明显错误的地方。
谢谢
Original data
=============
(769790, 207)
XGBoost
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 273.7087504863739, 'pred_time': 4.388766288757324, 'acc_train': 0.848625923953286, 'acc_test': 0.8481793735953962, 'f_train': 0.877928251001055, 'f_test': 0.8775348027423189}
-----------------------------------
Decision Tree
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 11.388459920883179, 'pred_time': 0.38187479972839355, 'acc_train': 0.8347195338988556, 'acc_test': 0.8338183140856598, 'f_train': 0.8735138626721308, 'f_test': 0.8728762797972536}
-----------------------------------
Random Forest
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=5, max_features=1, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 1.3620502948760986, 'pred_time': 0.8454875946044922, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
Neural Net
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(100,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=None,
shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 130.09251832962036, 'pred_time': 8.788004636764526, 'acc_train': 0.810022863378324, 'acc_test': 0.8106106860312553, 'f_train': 0.8429408284567822, 'f_test': 0.84336348394109}
-----------------------------------
AdaBoost
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 114.49720454216003, 'pred_time': 6.846264839172363, 'acc_train': 0.8319898933475364, 'acc_test': 0.830836981514439, 'f_train': 0.8676524880554248, 'f_test': 0.866917350579005}
-----------------------------------
Naive Bayes
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 2.338545322418213, 'pred_time': 2.913602828979492, 'acc_train': 0.696707868379688, 'acc_test': 0.6979565855622962, 'f_train': 0.8374139063372146, 'f_test': 0.8381986507744102}
-----------------------------------
QDA
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 17.64940857887268, 'pred_time': 6.382497072219849, 'acc_train': 0.5545554631782694, 'acc_test': 0.5551124332610192, 'f_train': 0.7616845459479327, 'f_test': 0.7619965387905216}
-----------------------------------
PCA data
=============
PCA component size: 1
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 12.907331943511963, 'pred_time': 2.0308330059051514, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 0.6030781269073486, 'pred_time': 0.03420734405517578, 'acc_train': 0.8074718429701607, 'acc_test': 0.8079021551332182, 'f_train': 0.8398076830188118, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=5, max_features=1, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 4.2026519775390625, 'pred_time': 0.5144689083099365, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(100,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=None,
shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 13.960830450057983, 'pred_time': 0.7337024211883545, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 9.310431957244873, 'pred_time': 2.949209451675415, 'acc_train': 0.807460476233778, 'acc_test': 0.8078956598552852, 'f_train': 0.8398003208188749, 'f_test': 0.8401793542652027}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.028026819229125977, 'pred_time': 0.019958019256591797, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.039576053619384766, 'pred_time': 0.021703481674194336, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
PCA component size: 2
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 17.529640436172485, 'pred_time': 2.1811327934265137, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 0.9235944747924805, 'pred_time': 0.03514695167541504, 'acc_train': 0.8074588524142948, 'acc_test': 0.8079021551332182, 'f_train': 0.8397974448899658, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=5, max_features=1, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 3.8425581455230713, 'pred_time': 0.519752025604248, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(100,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=None,
shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 17.796229362487793, 'pred_time': 1.4105899333953857, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 14.433330059051514, 'pred_time': 2.9874980449676514, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.09282994270324707, 'pred_time': 0.06884241104125977, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.06534266471862793, 'pred_time': 0.06316208839416504, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
PCA component size: 3
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 22.586288690567017, 'pred_time': 2.132150650024414, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 1.3756062984466553, 'pred_time': 0.0391697883605957, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=5, max_features=1, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 3.6991543769836426, 'pred_time': 0.5463252067565918, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(100,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=None,
shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 13.745409488677979, 'pred_time': 1.617872714996338, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 18.745909929275513, 'pred_time': 3.02945613861084, 'acc_train': 0.8074539809558451, 'acc_test': 0.8078956598552852, 'f_train': 0.8397946213935711, 'f_test': 0.8401793542652027}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.09948086738586426, 'pred_time': 0.07936644554138184, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.07803058624267578, 'pred_time': 0.07502388954162598, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
PCA component size: 4
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 28.096595287322998, 'pred_time': 2.079728364944458, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 1.9280765056610107, 'pred_time': 0.04021263122558594, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=5, max_features=1, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 4.067602872848511, 'pred_time': 0.5436885356903076, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(100,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=None,
shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 18.260048389434814, 'pred_time': 2.397339344024658, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 24.486289501190186, 'pred_time': 3.059351921081543, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.10924768447875977, 'pred_time': 0.08964681625366211, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.09738326072692871, 'pred_time': 0.08622312545776367, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
答案 0 :(得分:0)
我在您的代码中没有发现任何明显错误。
一些想法:
我希望当您将n_components
降低到1时,分类器将越来越相似,但与您观察到的不一样。
您仅在(1,6)
个PCA组件之间循环。通过遍历(1,10,20,30,100)
个组件,验证分类器是否正确训练。如果分类器仍然具有相同的性能,则说明您做错了-
也许还可以查看并手动验证在PCA features
期间transform
没有发生古怪的事情。只需执行相同的代码,然后查看新功能直方图,就可能会发生一些奇怪的事情。
检查解释的差异,并确保其他组件正在添加信息。 print(pca.explained_variance_ratio_)
鉴于所有207个features
的分类器有多么相似,一旦运行PCA
,它们可能会看到相同的东西。
使用默认参数(即非常简单的分类器),分类器在(1,6)
组件上的行为可能相同,但可能性不大。
还请确保您正确循环(看起来像您一样)并坚持进行一些健全性检查。祝你好运!