在Scikit Learn中运行SelectKBest后获取功能名称的最简单方法

时间:2016-10-03 19:35:15

标签: python pandas scikit-learn feature-selection

我想进行有监督的学习。

到目前为止,我知道要对所有功能进行有监督的学习。

但是,我还想进行K最佳功能的实验。

我阅读了文档,发现在Scikit中学习了SelectKBest方法。

不幸的是,我不确定在找到这些最佳功能后如何创建新数据框:

我们假设我想进行5项最佳功能的实验:

from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)

现在,如果我要添加下一行:

dataframe = pd.DataFrame(select_k_best_classifier)

我将收到一个没有功能名称的新数据帧(只有索引从0到4开始)。

我应该将其替换为:

dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)

我的问题是如何创建features_names列表?

我知道我应该使用:     select_k_best_classifier.get_support()

返回布尔值数组。

数组中的真值表示右列中的索引。

我应该如何将这个布尔数组与我可以通过该方法得到的所有特征名称的数组一起使用:

feature_names = list(features_dataframe.columns.values)

6 个答案:

答案 0 :(得分:33)

这对我有用,不需要循环。

# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep
cols = selector.get_support(indices=True)
# Create new dataframe with only desired columns, or overwrite existing
features_df_new = features_df[cols]

答案 1 :(得分:17)

对我来说,这段代码运行正常,更加“pythonic”:

count = 0
def rydbergset(set_forbidden, count):
    if len(set_forbidden) > 0:
        count+=1
        if (last_time == True) and (count>3):
            last_time = True  # Only relevant after you pass the check.
            print "this is the count greater than three-"+ str(count) # Only relevant after you pass the check.
            openthedoor(uppumanga)
            count = 0
        else:
            last_time = False
    return set_forbidden,count

答案 2 :(得分:16)

您可以执行以下操作:

mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features

for bool, feature in zip(mask, feature_names):
    if bool:
        new_features.append(feature)

然后更改功能名称:

dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)

答案 3 :(得分:5)

以下代码将帮助您找到具有F分数的前K个功能。设,X是pandas数据帧,其列是所有要素,y是类标签列表。

import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)

答案 4 :(得分:0)

还有另一种替代方法,但是这种方法不如上述解决方案快。

# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
                            index=train.index,
                            columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]

答案 5 :(得分:0)

根据chi2选择Best 10功能;

from sklearn.feature_selection import SelectKBest, chi2

KBest = SelectKBest(chi2, k=10).fit(X, y) 

通过get_support()获取功能

f = KBest.get_support(1) #the most important features

创建一个名为X_new的新df;

X_new = X[X.columns[f]] # final features`