Question

我在Python 3.6上使用了sklearn，我注意到将单个样本预测为1D numpy数组需要相同的运行时间，而将n个样本预测为具有随机森林的2D numpy数组（~0.1秒）。看起来sklearn需要一定的时间在每个预测步骤中设置树，然后立即进行预测。这可以解释为什么用于预测大型2D阵列的运行时与1D阵列相同？

这是我训练模型的代码：

clf = RandomForestClassifier(n_estimators=1, #or > 1 
        n_jobs=-1,
        random_state=2,
        max_depth=15,
        min_samples_leaf=1,
        verbose=0,
        max_features='auto'
        )

clf.fit(X_train, y_train)

with open('classifier.pkl', 'wb') as fid:
   cPickle.dump(clf, fid)

就我而言，我必须在一个循环中逐个实时预测：

with open('classifier.pkl', 'rb') as fid:
   clf = cPickle.load(fid)

for s in samples:
   #my feature extraction method
   pred = clf.predict(feature) #feature is a 1D np array containing features 
                               #computed for the sample s

是因为我以错误的方式使用它吗？或者sklearn只是没有针对逐个预测进行优化？

Answer 1

你是对的，features = np.zeros((len(samples), n_features)) for i, s in enumerate(samples): features[i] = feature_extraction(s) preds = clf.predict(features)针对向量操作进行了大量优化。您正确使用它。如果你这样做，你应该会看到显着的加速：

var userdetais = `<p><span style="color: rgb(75, 79, 86); font-family: Helvetica, Arial, sans-serif; font-size: 13px; white-space: pre-wrap; background-color: rgb(241, 240, 240);">মাসুদ আলম
সহ: শিক্ষক
ফরিদ উদ্দিন উচ্চ বিদ্যালয়
পো: আয়নাতলী, ডাকঘর: আয়নাতলী, শাহ্রাস্তি, চাঁদপুর-৩৬২২।
০১৭৪৮৬৮৫৪৮২</span></p>`;

console.log(userdetais);

为什么sklearn Random Forest需要相同的时间来预测一个样本而不是n个样本

1 个答案: