如何在特征选择中定义变量的名称

时间:2017-08-05 17:22:03

标签: python pandas scikit-learn feature-selection

我尝试在功能选择中定义变量的名称。我有这样的数据集

import pandas as pd
df = pd.DataFrame ({'a' : [1, 0,1, 0,1, 0,1, 0,1, 0 ],
             'b' : ['foo', 'bar','foo', 'bar','foo', 'bar','foo', 'bar','foo', 'bar' ] ,
             'c' : ['foo', 'bar','bar','foo','foo', 'bar','bar','foo','foo', 'bar' ],
                'd' :['d','d','b','a','d','d','a','b','d','a']    })

所以

X, y = df.ix[:, 1:], df.ix[:,[0]]
X_dummy = pd.get_dummies(X)

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X_new = SelectKBest(chi2, k=4).fit_transform(X_dummy, y)
X_new

array([[0, 1, 0, 1],
       [1, 0, 0, 1],
       [0, 1, 0, 0],
       [1, 0, 1, 0],
       [0, 1, 0, 1],
       [1, 0, 0, 1],
       [0, 1, 1, 0],
       [1, 0, 0, 0],
       [0, 1, 0, 1],
       [1, 0, 1, 0]], dtype=uint8)

我得到数组,但我想知道变量(bcd或它们的虚拟选项)必须在模型中加入。如何找到这个?谢谢!

1 个答案:

答案 0 :(得分:1)

您可以使用拟合选择器的scores_属性

>> kbest = SelectKBest(chi2, k=4)
>> X_new = kbest.fit_transform(X_dummy, y)
>> X_dummy.columns[kbest.scores_.argsort()[::-1][:4]]
Index(['b_foo', 'b_bar', 'd_a', 'd_d'], dtype='object')