大熊猫中的Sklearn特征选择

时间:2016-11-15 01:46:26

标签: python pandas machine-learning scikit-learn

data:

   children     pet    salary
0    4.0        cat     90
1    6.0        dog     24
2    3.0        dog     44
3    3.0        fish    27
4    2.0        cat     32
5    3.0        dog     59
6    5.0        cat     36
7    4.0        fish    27

code:

 from sklearn_pandas import DataFrameMapper, cross_val_score
 from sklearn.feature_selection import SelectKBest, chi2
 mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))])
 mapper_fs.fit_transform(data[['children','salary']], data['pet'])

result:

 array([[ 90.],
   [ 24.],
   [ 44.],
   [ 27.],
   [ 32.],
   [ 59.],
   [ 36.],
   [ 27.]])

我正在尝试对测试pandas数据进行sklearn功能选择编码,但我无法对结果进行编码。我从官方文档中获取了代码。 PLease建议我如何代表结果。如果我在pandas数据框中有n列,如何从数据框中的所有列中选择最佳k。

1 个答案:

答案 0 :(得分:1)

如果您尝试选择数据系列集的 k-best 功能,我确信您执行错误的方式有多种原因:

  • DataFrameMapper完全无用
  • 如果只有2个功能
  • ,您希望获得数据集的k=2最佳功能
  • 您需要先对分类功能data['pet']进行编码,然后再将其提供给fit功能

在这里你应该怎么做:

from sklearn.feature_selection import SelectKBest, chi2

X = # your dataframe with n columns
y = # target values - encoded if categorical
# instanciate your selector
selector = SelectKBest(chi2, k=...) # k < n, try something like int(round(n/10.))
# Fit it to your data
selector.fit(X, y) # returns the selector itself but fitted
# You can transform your data using the fit_transform method if you want

# Now at this step you have reduce the dimensionality of your feature space. You can now perform a classification

一条建议: 如果您不知道某些内容是如何工作的,请尝试阅读文档或在线查找一些教程。我从未见过使用DataFrameMapper的在线功能选择,除了你的...