我试图在合成的多标签数据集上进行特征选择。据观察,与一次提供一个特征所需的时间相比,将完整数据集提供给SelectKBest的计算时间要高得多。在下面的示例中,仅考虑一个标签(或目标变量)。
import pandas as pd
from sklearn.datasets import make_multilabel_classification
from sklearn.feature_selection import chi2, SelectKBest, f_classif
# Generate a multilabel dataset
x, y = make_multilabel_classification(n_samples=40000, n_features = 1000, sparse = False, n_labels = 4, n_classes = 9,
return_indicator = 'dense', allow_unlabeled = True, random_state = 1000)
X_df = pd.DataFrame(x)
y_df = pd.DataFrame(y)
%%time
selected_features2 = []
for label in y_df.columns.tolist()[0:1]:
selector = SelectKBest(f_classif, k='all')
selected_features = []
for ftr in X_df.columns.tolist():
selector.fit(X_df[[ftr]], y_df[label])
selected_features.extend(np.round(selector.scores_,4))
CPU时间:用户3.2秒,sys:0 ns,总计:3.2秒。停电时间:3.18秒
%%time
sel_features = []
for label in y_df.columns.tolist()[0:1]:
selector = SelectKBest(f_classif, k='all')
selector.fit(X_df, y_df[label])
sel_features.extend(np.round(selector.scores_,4))
CPU时间:用户208毫秒,系统:37.2秒,总计:37.4秒挂壁时间:37.4秒
%%time
sel_features = []
for label in y_df.columns.tolist()[0:1]:
selector = SelectKBest(f_classif, k='all')
selector.fit(X_df.as_matrix(), y_df[label].as_matrix())
sel_features.extend(np.round(selector.scores_,4))
CPU时间:用户220毫秒,系统:35.4秒,总计:35.7秒挂壁时间:35.6秒。
为什么计算时间有这么大差异?