Question

我试图在合成的多标签数据集上进行特征选择。据观察，与一次提供一个特征所需的时间相比，将完整数据集提供给SelectKBest的计算时间要高得多。在下面的示例中，仅考虑一个标签（或目标变量）。

import pandas as pd
from sklearn.datasets import make_multilabel_classification
from sklearn.feature_selection import chi2, SelectKBest, f_classif

# Generate a multilabel dataset
x, y = make_multilabel_classification(n_samples=40000, n_features = 1000, sparse = False, n_labels = 4, n_classes = 9,
  return_indicator = 'dense', allow_unlabeled = True, random_state = 1000)

X_df = pd.DataFrame(x)
y_df = pd.DataFrame(y)

%%time
selected_features2 = [] 
for label in y_df.columns.tolist()[0:1]:
    selector = SelectKBest(f_classif, k='all')
    selected_features = [] 
    for ftr in X_df.columns.tolist():
        selector.fit(X_df[[ftr]], y_df[label])
        selected_features.extend(np.round(selector.scores_,4))

CPU时间：用户3.2秒，sys：0 ns，总计：3.2秒。停电时间：3.18秒

%%time
sel_features = []
for label in y_df.columns.tolist()[0:1]:
    selector = SelectKBest(f_classif, k='all')
    selector.fit(X_df, y_df[label])
    sel_features.extend(np.round(selector.scores_,4))

CPU时间：用户208毫秒，系统：37.2秒，总计：37.4秒挂壁时间：37.4秒

%%time
sel_features = []
for label in y_df.columns.tolist()[0:1]:
    selector = SelectKBest(f_classif, k='all')
    selector.fit(X_df.as_matrix(), y_df[label].as_matrix())
    sel_features.extend(np.round(selector.scores_,4))

CPU时间：用户220毫秒，系统：35.4秒，总计：35.7秒挂壁时间：35.6秒。

为什么计算时间有这么大差异？

选择KBest可以改变计算时间

0 个答案: