Question

刚开始使用这个库...使用RandomForestClassifiers有一些问题（我已经阅读过文档，但没有明白）

我的问题非常简单，比如我有一个像

这样的火车数据集

A B C

1 2 3

其中A是自变量（y），B-C是因变量（x）。让我们说测试集看起来一样，但顺序是

B A C

1 2 3

当我致电forest.fit(train_data[0:,1:],train_data[0:,0])时然后我需要在运行之前重新排序测试集以匹配此顺序吗？（忽略我需要删除已经预测的y值（a）的事实，所以我们只说B和C乱序......）

Answer 1

是的，您需要重新排序它们。想象一个更简单的案例，线性回归。该算法将计算每个特征的权重，因此，例如，如果特征1不重要，则将为其分配接近0权重。

如果在预测时间顺序不同，则一个重要的特征将乘以这几乎为零的权重，预测将完全关闭。

Answer 2

裂解酶是正确的。 scikit-learn只会按照您指定的顺序获取数据。因此，您必须确保在训练和预测期间数据的顺序相同。

这是一个简单的说明性示例：

培训时间：

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
x = pd.DataFrame({
    'feature_1': [0, 0, 1, 1],
    'feature_2': [0, 1, 0, 1]
})
y = [0, 0, 1, 1]
model.fit(x, y) 
# we now have a model that 
# (i)  predicts 0 when x = [0, 0] or [0, 1], and 
# (ii) predicts 1 when x = [1, 0] or [1, 1]

预测时间：

# positive example
http_request_payload = {
    'feature_1': 0,
    'feature_2': 1
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 0, as expected


# negative example
http_request_payload = {
    'feature_2': 1,    # notice that the order is jumbled up
    'feature_1': 0
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 1, when it should have returned 0. 
# scikit-learn doesn't care about the key-value mapping of the features. 
# it simply vectorizes the dataframe in whatever order it comes in.

这是我在训练期间缓存列顺序的方式，以便可以在预测期间使用它。

# training
x = pd.DataFrame([...])
column_order = x.columns
model = SomeModel().fit(x, y) # train model

# save the things that we need at prediction time. you can also use pickle if you don't want to pip install joblib
import joblib  

joblib.dump(model, 'my_model.joblib') 
joblib.dump(column_order, 'column_order.txt') 

# load the artifacts from disk
model = joblib.load('linear_model.joblib') 
column_order = joblib.load('column_order.txt') 

# imaginary http request payload
request_payload = { 'feature_1': ..., 'feature_1': ... }

# create empty dataframe with the right shape and order (using column_order)
input_features = pd.DataFrame([], columns=column_order)
input_features = input_features.append(request_payload, ignore_index=True)
input_features = input_features.fillna(0) # handle any missing data however you like

model.predict(input_features.values.tolist())

Scikitlearn - 拟合和预测输入的顺序，是否重要？

2 个答案: