在进行PCA分解后,所有分类器都给我完全相同的精度

时间:2018-11-27 20:51:47

标签: machine-learning pca

我正在运行一些机器学习代码,部分代码如下:

classifiers = [XGBClassifier(), DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]

print("Original data")
print("=============")
print(features.shape)
for name, clf in zip(names, classifiers):
    print(name)
    X_train, X_test, y_train, y_test = train_test_split(features, loan_status, test_size = 0.2, random_state = 0)
    result = train_predict(clf, len(y_train), X_train, y_train, X_test, y_test)
    print(result)
    print('-----------------------------------')

print("PCA data")
print("=============")
for pca_comp in range(1,6):
    print("PCA component size: " + str(pca_comp))
    pca = decomposition.PCA(n_components=pca_comp)
    pca.fit(features)
    features_pca = pca.transform(features)
    for name, clf in zip(names, classifiers):
        X_train, X_test, y_train, y_test = train_test_split(features_pca, loan_status, test_size = 0.2, random_state = 0)
        result = train_predict(clf, len(y_train), X_train, y_train, X_test, y_test)
        print(result)
        print('-----------------------------------')

有效地,我正在遍历多个分类器并打印其结果。 然后,我遍历不同的n_component大小进行PCA分解,然后再次在所有分类器上运行。

我发现,一旦我开始进行PCA,无论使用什么分类器或选择n_component的值是多少,准确性(acc_test和acc_train)都保持不变。

这是此部分代码的输出。 请注意,一旦PCA启动,“ acc_test”始终为0.8079021551332182。

很遗憾,我无法共享数据。 但是,我正在寻找代码中明显错误的地方。

谢谢

Original data
=============
(769790, 207)
XGBoost
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 273.7087504863739, 'pred_time': 4.388766288757324, 'acc_train': 0.848625923953286, 'acc_test': 0.8481793735953962, 'f_train': 0.877928251001055, 'f_test': 0.8775348027423189}
-----------------------------------
Decision Tree
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 11.388459920883179, 'pred_time': 0.38187479972839355, 'acc_train': 0.8347195338988556, 'acc_test': 0.8338183140856598, 'f_train': 0.8735138626721308, 'f_test': 0.8728762797972536}
-----------------------------------
Random Forest
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 1.3620502948760986, 'pred_time': 0.8454875946044922, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
Neural Net
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 130.09251832962036, 'pred_time': 8.788004636764526, 'acc_train': 0.810022863378324, 'acc_test': 0.8106106860312553, 'f_train': 0.8429408284567822, 'f_test': 0.84336348394109}
-----------------------------------
AdaBoost
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 114.49720454216003, 'pred_time': 6.846264839172363, 'acc_train': 0.8319898933475364, 'acc_test': 0.830836981514439, 'f_train': 0.8676524880554248, 'f_test': 0.866917350579005}
-----------------------------------
Naive Bayes
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 2.338545322418213, 'pred_time': 2.913602828979492, 'acc_train': 0.696707868379688, 'acc_test': 0.6979565855622962, 'f_train': 0.8374139063372146, 'f_test': 0.8381986507744102}
-----------------------------------
QDA
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 17.64940857887268, 'pred_time': 6.382497072219849, 'acc_train': 0.5545554631782694, 'acc_test': 0.5551124332610192, 'f_train': 0.7616845459479327, 'f_test': 0.7619965387905216}
-----------------------------------
PCA data
=============
PCA component size: 1
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 12.907331943511963, 'pred_time': 2.0308330059051514, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 0.6030781269073486, 'pred_time': 0.03420734405517578, 'acc_train': 0.8074718429701607, 'acc_test': 0.8079021551332182, 'f_train': 0.8398076830188118, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 4.2026519775390625, 'pred_time': 0.5144689083099365, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 13.960830450057983, 'pred_time': 0.7337024211883545, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 9.310431957244873, 'pred_time': 2.949209451675415, 'acc_train': 0.807460476233778, 'acc_test': 0.8078956598552852, 'f_train': 0.8398003208188749, 'f_test': 0.8401793542652027}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.028026819229125977, 'pred_time': 0.019958019256591797, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.039576053619384766, 'pred_time': 0.021703481674194336, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
PCA component size: 2
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 17.529640436172485, 'pred_time': 2.1811327934265137, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 0.9235944747924805, 'pred_time': 0.03514695167541504, 'acc_train': 0.8074588524142948, 'acc_test': 0.8079021551332182, 'f_train': 0.8397974448899658, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 3.8425581455230713, 'pred_time': 0.519752025604248, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 17.796229362487793, 'pred_time': 1.4105899333953857, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 14.433330059051514, 'pred_time': 2.9874980449676514, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.09282994270324707, 'pred_time': 0.06884241104125977, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.06534266471862793, 'pred_time': 0.06316208839416504, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
PCA component size: 3
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 22.586288690567017, 'pred_time': 2.132150650024414, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 1.3756062984466553, 'pred_time': 0.0391697883605957, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 3.6991543769836426, 'pred_time': 0.5463252067565918, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 13.745409488677979, 'pred_time': 1.617872714996338, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 18.745909929275513, 'pred_time': 3.02945613861084, 'acc_train': 0.8074539809558451, 'acc_test': 0.8078956598552852, 'f_train': 0.8397946213935711, 'f_test': 0.8401793542652027}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.09948086738586426, 'pred_time': 0.07936644554138184, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.07803058624267578, 'pred_time': 0.07502388954162598, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
PCA component size: 4
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
XGBClassifier trained on 615832 samples.
{'train_time': 28.096595287322998, 'pred_time': 2.079728364944458, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
DecisionTreeClassifier trained on 615832 samples.
{'train_time': 1.9280765056610107, 'pred_time': 0.04021263122558594, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=1, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
RandomForestClassifier trained on 615832 samples.
{'train_time': 4.067602872848511, 'pred_time': 0.5436885356903076, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
MLPClassifier(activation='relu', alpha=1, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
MLPClassifier trained on 615832 samples.
{'train_time': 18.260048389434814, 'pred_time': 2.397339344024658, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
AdaBoostClassifier trained on 615832 samples.
{'train_time': 24.486289501190186, 'pred_time': 3.059351921081543, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
GaussianNB(priors=None)
GaussianNB trained on 615832 samples.
{'train_time': 0.10924768447875977, 'pred_time': 0.08964681625366211, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------
QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
               store_covariance=False, store_covariances=None, tol=0.0001)
QuadraticDiscriminantAnalysis trained on 615832 samples.
{'train_time': 0.09738326072692871, 'pred_time': 0.08622312545776367, 'acc_train': 0.8074556047753283, 'acc_test': 0.8079021551332182, 'f_train': 0.83979517561563, 'f_test': 0.8401815688685045}
-----------------------------------

1 个答案:

答案 0 :(得分:0)

我在您的代码中没有发现任何明显错误。

一些想法:

我希望当您将n_components降低到1时,分类器将越来越相似,但与您观察到的不一样。

您仅在(1,6)个PCA组件之间循环。通过遍历(1,10,20,30,100)个组件,验证分类器是否正确训练。如果分类器仍然具有相同的性能,则说明您做错了-

也许还可以查看并手动验证在PCA features期间transform没有发生古怪的事情。只需执行相同的代码,然后查看新功能直方图,就可能会发生一些奇怪的事情。

检查解释的差异,并确保其他组件正在添加信息。 print(pca.explained_variance_ratio_)

鉴于所有207个features的分类器有多么相似,一旦运行PCA,它们可能会看到相同的东西。

使用默认参数(即非常简单的分类器),分类器在(1,6)组件上的行为可能相同,但可能性不大。

还请确保您正确循环(看起来像您一样)并坚持进行一些健全性检查。祝你好运!