PCA在应用于新数据时性能崩溃

时间:2017-12-12 10:57:35

标签: python scikit-learn pca

我正在使用PCA进行降维,我的训练数据有1200个记录,包含335个维度。这是我训练模型的代码

X, y = load_data(f_file1)
valid_X, valid_y = load_data(f_file2)

pca = PCA(n_components=n_compo, whiten=True)
X = pca.fit_transform(X)
valid_input = pca.transform(valid_X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
clf = DecisionTreeClassifier(criterion='entropy', max_depth=30, 
          min_samples_leaf=2, class_weight={0: 10, 1: 1})  # imbalanced class
clf.fit(X_train, y_train)

print(clf.score(X_train, y_train)*100, 
      clf.score(X_test, y_test)*100,
      recall_score(y_train, clf.predict(X_train))*100, 
      recall_score(y_test, clf.predict(X_test))*100,
      precision_score(y_train, clf.predict(X_train))*100, 
      precision_score(y_test, clf.predict(X_test))*100,
      auc(*roc_curve(y_train, clf.predict_proba(X_train)[:, 1], pos_label=1)[:-1])*100,
      auc(*roc_curve(y_test, clf.predict_proba(X_test)[:, 1], pos_label=1)[:-1])*100)

print(precision_score(valid_y, clf.predict(valid_input))*100, 
      recall_score(valid_y, clf.predict(valid_input))*100,
      accuracy_score(valid_y, clf.predict(valid_input))*100,
      auc(*roc_curve(valid_y, clf.predict_proba(valid_input)[:, 1], pos_label=1)[:-1])*100)

输出

99.80, 99.32, 99.87, 99.88, 99.74, 98.78, 99.99, 99.46
0.00, 0.00, 97.13, 49.98, 700.69

因此召回和精确度为0。为什么PCA似乎不能验证数据并且模型是否过度装配?

1 个答案:

答案 0 :(得分:1)

可能是因为

而过度装修了
max_depth=30

太过分了。

您是如何选择PCA尺寸的?通过特征向量/特征值方法可以得到的最佳值:

data = data.values
mean = np.mean(data.T, axis=1)
demeaned = data - mean
evals, evecs = np.linalg.eig(np.cov(demeaned.T))
order = evals.argsort()[::-1]

evals = evals[order]

plt.plot(evals)
plt.grid(True)
plt.savefig('_!pca.png')

您通过x值选择的最佳值,其中线条下降到非常零。