选择KBest可以改变计算时间

时间:2018-06-06 11:29:01

标签: machine-learning scikit-learn

我试图在合成的多标签数据集上进行特征选择。据观察,与一次提供一个特征所需的时间相比,将完整数据集提供给SelectKBest的计算时间要高得多。在下面的示例中,仅考虑一个标签(或目标变量)。

import pandas as pd
from sklearn.datasets import make_multilabel_classification
from sklearn.feature_selection import chi2, SelectKBest, f_classif

# Generate a multilabel dataset
x, y = make_multilabel_classification(n_samples=40000, n_features = 1000, sparse = False, n_labels = 4, n_classes = 9,
  return_indicator = 'dense', allow_unlabeled = True, random_state = 1000)

X_df = pd.DataFrame(x)
y_df = pd.DataFrame(y)

%%time
selected_features2 = [] 
for label in y_df.columns.tolist()[0:1]:
    selector = SelectKBest(f_classif, k='all')
    selected_features = [] 
    for ftr in X_df.columns.tolist():
        selector.fit(X_df[[ftr]], y_df[label])
        selected_features.extend(np.round(selector.scores_,4))
  

CPU时间:用户3.2秒,sys:0 ns,总计:3.2秒。停电时间:3.18秒

%%time
sel_features = []
for label in y_df.columns.tolist()[0:1]:
    selector = SelectKBest(f_classif, k='all')
    selector.fit(X_df, y_df[label])
    sel_features.extend(np.round(selector.scores_,4))
  

CPU时间:用户208毫秒,系统:37.2秒,总计:37.4秒挂壁时间:37.4秒

%%time
sel_features = []
for label in y_df.columns.tolist()[0:1]:
    selector = SelectKBest(f_classif, k='all')
    selector.fit(X_df.as_matrix(), y_df[label].as_matrix())
    sel_features.extend(np.round(selector.scores_,4))
  

CPU时间:用户220毫秒,系统:35.4秒,总计:35.7秒挂壁时间:35.6秒。

为什么计算时间有这么大差异?

0 个答案:

没有答案