data:
children pet salary
0 4.0 cat 90
1 6.0 dog 24
2 3.0 dog 44
3 3.0 fish 27
4 2.0 cat 32
5 3.0 dog 59
6 5.0 cat 36
7 4.0 fish 27
code:
from sklearn_pandas import DataFrameMapper, cross_val_score
from sklearn.feature_selection import SelectKBest, chi2
mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))])
mapper_fs.fit_transform(data[['children','salary']], data['pet'])
result:
array([[ 90.],
[ 24.],
[ 44.],
[ 27.],
[ 32.],
[ 59.],
[ 36.],
[ 27.]])
我正在尝试对测试pandas数据进行sklearn功能选择编码,但我无法对结果进行编码。我从官方文档中获取了代码。 PLease建议我如何代表结果。如果我在pandas数据框中有n列,如何从数据框中的所有列中选择最佳k。
答案 0 :(得分:1)
如果您尝试选择数据系列集的 k-best 功能,我确信您执行错误的方式有多种原因:
DataFrameMapper
完全无用k=2
最佳功能
data['pet']
进行编码,然后再将其提供给fit
功能在这里你应该怎么做:
from sklearn.feature_selection import SelectKBest, chi2
X = # your dataframe with n columns
y = # target values - encoded if categorical
# instanciate your selector
selector = SelectKBest(chi2, k=...) # k < n, try something like int(round(n/10.))
# Fit it to your data
selector.fit(X, y) # returns the selector itself but fitted
# You can transform your data using the fit_transform method if you want
# Now at this step you have reduce the dimensionality of your feature space. You can now perform a classification
一条建议:
如果您不知道某些内容是如何工作的,请尝试阅读文档或在线查找一些教程。我从未见过使用DataFrameMapper
的在线功能选择,除了你的...