Question

我正在尝试做一个篮球项目。在这个项目中，我拥有大量有关过去球员表现的数据。有54个功能。我刚刚对PCA和z分数有所了解（对此仍然很模糊）。

我可以使用PCA对功能进行功能选择吗？

谢谢！

Answer 1

嗯，进行PCA并计算Z分数可能会让您到达那里，但是有很多更好的方法来解决此类问题。请考虑使用功能工程，以识别与一组数据（因变量）最相关的功能，并删除不相关或次要的功能不会对我们的目标变量有太大贡献（以实现更好的整体准确性）我们的模型）。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


df = pd.read_csv("https://rodeo-tutorials.s3.amazonaws.com/data/credit-data-trainingset.csv")
df.head()

from sklearn.ensemble import RandomForestClassifier

features = np.array(['revolving_utilization_of_unsecured_lines',
                     'age', 'number_of_time30-59_days_past_due_not_worse',
                     'debt_ratio', 'monthly_income','number_of_open_credit_lines_and_loans', 
                     'number_of_times90_days_late', 'number_real_estate_loans_or_lines',
                     'number_of_time60-89_days_past_due_not_worse', 'number_of_dependents'])
clf = RandomForestClassifier()
clf.fit(df[features], df['serious_dlqin2yrs'])

# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()

只需进行所需的任何更改（非常明显的更改），即可针对特定数据集自定义该代码。

这里有几个链接，进一步说明了特征工程的工作原理。

https://github.com/WillKoehrsen/feature-selector/blob/master/Feature%20Selector%20Usage.ipynb

https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e

作为参考，这是一个更好地了解PCA的好链接。

https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html

此外，这是一个很好的链接，可以更好地了解Z分数。

Pandas - Compute z-score for all columns

Answer 2

好吧，这取决于功能的重要性和获得的分数（例如准确性，F1分数，ROC）。如果模型过拟合，则可以删除不太重要的功能。

https://en.wikipedia.org/wiki/Curse_of_dimensionality

PCA不必如此，除了ASH的响应外，您还可以使用其他树模型来查找功能的重要性。只是不要忘记在建模之前缩放要素，如果不进行缩放，那么重要性结果可能会被破坏。

我可以使用pca或z评分来选择功能吗？

2 个答案: