Question

我正在学习用于特征选择的chi2，并且遇到了类似this

的代码

但是，我对chi2的理解是，较高的分数意味着该功能更多是独立的（因此对模型的使用较少），因此我们将对分数最低的功能感兴趣。但是，使用scikit学习SelectKBest，选择器将返回具有最高 chi2分数的值。我对使用chi2测试的理解不正确吗？还是sklearn中的chi2分数产生了chi2统计数据以外的其他东西？

有关我的意思，请参见下面的代码（除了结尾之外，大部分都是从上面的链接复制的）

from sklearn.datasets import load_iris
# Load iris data
iris = load_iris()

# Create features and target
X = iris.data
y = iris.target

# Convert to categorical data by converting data to integers
X = X.astype(int)

# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
chi2_selector.fit(X, y)

# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(iris.feature_names, chi2_selector.scores_, chi2_selector.pvalues_)), columns=['ftr', 'score', 'pval'])
chi2_scores

# you can see that the kbest returned from SelectKBest 
#+ were the two features with the _highest_ score
kbest = np.asarray(iris.feature_names)[chi2_selector.get_support()]
kbest

Answer 1

您的理解被颠倒了。

chi2检验的原假设是“两个分类变量是独立的”。因此，chi2统计量的较高值意味着“两个分类变量是相关的”，并且分类更有用。

SelectKBest根据较高的chi2值为您提供最佳的两个（k = 2）功能。因此，您需要获得它提供的那些功能，而不是在chi2选择器上获得“其他功能”。

从chi2_selector.scores_获取chi2统计信息，从chi2_selector.get_support（）获得最佳功能是正确的。根据独立性测试的chi2测试，它将为您提供“花瓣长度（cm）”和“花瓣宽度（cm）”作为前两项功能。希望它能阐明此算法。

Sklearn Chi2用于功能选择

1 个答案: