Question

我正在使用RandomForestClassifier实现具有二进制结果的分类任务，并且我知道进行数据预处理以提高准确性得分的重要性。特别是，我的数据集包含100多个特征和将近4000个实例，并且我想执行降维技术以避免过度拟合，因为数据中存在大量噪声。

对于这些任务，我通常使用经典的特征选择方法（过滤器，包装器，特征重要性），但最近我阅读了有关结合主成分分析（PCA）（第一步），然后在转换后的数据集中进行特征选择的方法。 / p>

我的问题如下：对数据执行PCA之后，是否应该使用一种特定的功能选择方法？特别是，我想了解的是在数据上使用PCA是否会使某些特定的特征选择技术无效或效率较低。

Answer 1

让我们从何时开始使用PCA开始。

当您不确定数据的哪一部分会影响准确性时，PCA最为有用。

让我们考虑一下面部识别任务。一眼就能说出最关键的像素吗？

例如：Olivetti面对。 40个人，深色均匀背景，变化的光照，面部表情（睁眼/闭眼，微笑/不微笑）和面部细节（眼镜/不戴眼镜）。

所以，如果我们看一下像素之间的相关性：

from sklearn.datasets import fetch_olivetti_faces
from numpy import corrcoef
from numpy import zeros_like
from numpy import triu_indices_from
from matplotlib.pyplot import figure
from matplotlib.pyplot import get_cmap
from matplotlib.pyplot import plot
from matplotlib.pyplot import colorbar
from matplotlib.pyplot import subplots
from matplotlib.pyplot import suptitle
from matplotlib.pyplot import imshow
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import savefig
from matplotlib.image import imread
import seaborn


olivetti = fetch_olivetti_faces()

X = olivetti.images  # Train
y = olivetti.target  # Labels

X = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))

seaborn.set(font_scale=1.2)
seaborn.set_style("darkgrid")
mask = zeros_like(corrcoef(X_resp))
mask[triu_indices_from(mask)] = True
with seaborn.axes_style("white"):
    f, ax = subplots(figsize=(20, 15))
    ax = seaborn.heatmap(corrcoef(X), 
                         annot=True, 
                         mask=mask, 
                         vmax=1,
                         vmin=0,
                         square=True, 
                         cmap="YlGnBu",
                         annot_kws={"size": 1})
    
savefig('heatmap.png')

从上面您可以告诉我哪些像素对分类最重要？

但是，如果我问你，“能否请你告诉我慢性肾脏病的最重要特征？”

您可以一目了然地告诉我：

如果我们从人脸识别任务中恢复过来，我们真的需要所有像素进行分类吗？

不，我们不。

上面您只能看到63像素，足以将一张脸识别为人。

请注意，识别一张脸而不是识别脸足以满足63像素的要求。您需要更多像素才能区分人脸。

所以我们要做的是减小尺寸。您可能想了解有关Curse of dimensionality

的更多信息

好，所以我们决定使用PCA，因为我们不需要面部图像的每个像素。我们必须减小尺寸。

为了使视觉上易于理解，我使用2维。

def projection(obj, x, x_label, y_label, title, class_num=40, sample_num=10, dpi=300):
    x_obj = obj.transform(x)
    idx_range = class_num * sample_num
    fig = figure(figsize=(6, 3), dpi=dpi)
    ax = fig.add_subplot(1, 1, 1)
    c_map = get_cmap(name='jet', lut=class_num)
    scatter = ax.scatter(x_obj[:idx_range, 0], x_obj[:idx_range, 1], c=y[:idx_range],
                         s=10, cmap=c_map)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.set_title(title.format(class_num))
    colorbar(mappable=scatter)
    


pca_obj = PCA(n_components=2).fit(X)
x_label = "First Principle Component"
y_label = "Second Principle Component"
title = "PCA Projection of {} people"
projection(obj=pca_obj, x=X, x_label=x_label, y_label=y_label, title=title)

如您所见，具有2个组件的PCA不足以进行区分。

那么您需要多少个组件？

def display_n_components(obj):
    figure(1, figsize=(6,3), dpi=300)
    plot(obj.explained_variance_, linewidth=2)
    xlabel('Components')
    ylabel('Explained Variaces')


pca_obj2 = PCA().fit(X)
display_n_components(pca_obj2)

您需要100个组件来进行良好的区分。

现在我们需要拆分训练和测试集。

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train = X_train.reshape((X_train.shape[0], X.shape[1] * X.shape[2])) 
X_test = X_test.reshape((X_test.shape[0], X.shape[1] * X.shape[2]))

pca = PCA(n_components=100).fit(X)
X_pca_tr = pca.transform(X_train)
X_pca_te = pca.transform(X_test)

forest1 = RandomForestClassifier(random_state=42)
forest1.fit(X_pca_tr, y_train)
y_pred = forest1.predict(X_pca_te)
print("\nAccuracy:{:,.2f}%".format(accuracy_score(y_true=y_test, y_pred=y_pred_)*100))

精度将是：

您可能想知道，PCA是否会提高准确性？

答案是肯定的。

没有PCA：

PCA之后的最佳特征选择技术？

1 个答案: