Question

我有一个如下的数据集：

| "Consignor Code" | "Consignee Code" | "Origin" | "Destination" | "Carrier Code" | 
|------------------|------------------|----------|---------------|----------------| 
| "6402106844"     | "66903717"       | "DKCPH"  | "CNPVG"       | "6402746387"   | 
| "6402106844"     | "66903717"       | "DKCPH"  | "CNPVG"       | "6402746387"   | 
| "6402106844"     | "6404814143"     | "DKCPH"  | "CNPVG"       | "6402746387"   | 
| "6402107662"     | "66974631"       | "DKCPH"  | "VNSGN"       | "6402746393"   | 
| "6402107662"     | "6404518090"     | "DKCPH"  | "THBKK"       | "6402746393"   | 
| "6402107662"     | "6404518090"     | "DKBLL"  | "THBKK"       | "6402746393"   | 
| "6408507648"     | "6403601344"     | "DKCPH"  | "USTPA"       | "66565231"     |

我正在尝试在其上构建我的第一个ML模型。为此，我正在使用scikit-learn。这是我的代码：

#Import the dependencies
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.externals import joblib
from sklearn import preprocessing
import pandas as pd

#Import the dataset (A CSV file)
dataset = pd.read_csv('shipments.csv', header=0, skip_blank_lines=True)
#Drop any rows containing NaN values
dataset.dropna(subset=['Consignor Code', 'Consignee Code',
                       'Origin', 'Destination', 'Carrier Code'], inplace=True)

#Convert the numeric only cells to strings
dataset['Consignor Code'] = dataset['Consignor Code'].astype('int64')
dataset['Consignee Code'] = dataset['Consignee Code'].astype('int64')
dataset['Carrier Code'] = dataset['Carrier Code'].astype('int64')

#Define our target (What we want to be able to predict)
target = dataset.pop('Destination')

#Convert all our data to numeric values, so we can use the .fit function.
#For that, we use LabelEncoder
le = preprocessing.LabelEncoder()
target = le.fit_transform(list(target))
dataset['Origin'] = le.fit_transform(list(dataset['Origin']))
dataset['Consignor Code'] = le.fit_transform(list(dataset['Consignor Code']))
dataset['Consignee Code'] = le.fit_transform(list(dataset['Consignee Code']))
dataset['Carrier Code'] = le.fit_transform(list(dataset['Carrier Code']))

#Prepare the dataset.
X_train, X_test, y_train, y_test = train_test_split(
    dataset, target, test_size=0.3, random_state=0)


#Prepare the model and .fit it.
model = RandomForestClassifier()
model.fit(X_train, y_train)

#Make a prediction on the test set.
predictions = model.predict(X_test)

#Print the accuracy score.
print("Accuracy score: {}".format(accuracy_score(y_test, predictions)))

上面的代码现在返回：

Accuracy score: 0.7172413793103448

现在我的问题可能很愚蠢-但是我该如何使用我的model来向我展示它对新数据的预测？

考虑下面的新输入，我希望它能预测Destination：

"6408507648","6403601344","DKCPH","","66565231"

如何用这些数据查询我的模型并获得预测的Destination？

Answer 1

这里有一个包含预测的完整示例。最重要的部分是为每个功能定义不同的标签编码器，因此您可以使用相同的编码来适应新数据，否则会遇到错误（现在可能会显示出来，但是在计算准确性时您会注意到）： / p>

dataset = pd.DataFrame({'Consignor Code':["6402106844","6402106844","6402106844","6402107662","6402107662","6402107662","6408507648"],
                   'Consignee Code': ["66903717","66903717","6404814143","66974631","6404518090","6404518090","6403601344"],
                   'Origin':["DKCPH","DKCPH","DKCPH","DKCPH","DKCPH","DKBLL","DKCPH"],
                   'Destination':["CNPVG","CNPVG","CNPVG","VNSGN","THBKK","THBKK","USTPA"],
                   'Carrier Code':["6402746387","6402746387","6402746387","6402746393","6402746393","6402746393","66565231"]})

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.externals import joblib
from sklearn import preprocessing
import pandas as pd

#Import the dataset (A CSV file)
#Drop any rows containing NaN values
dataset.dropna(subset=['Consignor Code', 'Consignee Code',
                       'Origin', 'Destination', 'Carrier Code'], inplace=True)


#Define our target (What we want to be able to predict)
target = dataset.pop('Destination')

#Convert all our data to numeric values, so we can use the .fit function.
#For that, we use LabelEncoder
le_origin = preprocessing.LabelEncoder()
le_consignor = preprocessing.LabelEncoder()
le_consignee = preprocessing.LabelEncoder()
le_carrier = preprocessing.LabelEncoder()
le_target = preprocessing.LabelEncoder()
target = le_target.fit_transform(list(target))
dataset['Origin'] = le_origin.fit_transform(list(dataset['Origin']))
dataset['Consignor Code'] = le_consignor.fit_transform(list(dataset['Consignor Code']))
dataset['Consignee Code'] = le_consignee.fit_transform(list(dataset['Consignee Code']))
dataset['Carrier Code'] = le_carrier.fit_transform(list(dataset['Carrier Code']))

#Prepare the dataset.
X_train, X_test, y_train, y_test = train_test_split(
    dataset, target, test_size=0.3, random_state=42)


#Prepare the model and .fit it.
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

#Make a prediction on the test set.
predictions = model.predict(X_test)

#Print the accuracy score.
print("Accuracy score: {}".format(accuracy_score(y_test, predictions)))

new_input = ["6408507648","6403601344","DKCPH","66565231"]
fitted_new_input = np.array([le_consignor.transform([new_input[0]])[0],
                                le_consignee.transform([new_input[1]])[0],
                                le_origin.transform([new_input[2]])[0],
                                le_carrier.transform([new_input[3]])[0]])
new_predictions = model.predict(fitted_new_input.reshape(1,-1))

print(le_target.inverse_transform(new_predictions))

最后，您的树会预测：

['THBKK']

Answer 2

这里可以快速说明这一点。我在实践中不会这样做，并且可能会有一些错误。例如，我认为如果测试集中存在看不见的类，这将失败。

#Prepare the dataset.
X_train, X_test, y_train, y_test = train_test_split(
    dataset, target, test_size=0.3, random_state=0)

#Convert all our data to numeric values, so we can use the .fit function.
#For that, we use LabelEncoder
le_target = preprocessing.LabelEncoder()
y_train = le_target.fit_transform(y_train)
y_test = le_target.transform(y_test)

# Now create a separate encoder for each of your features:
encoders = {}
for feature in ["Origin", "Consignor Code", "Consignee Code", "Carrier Code"]:
# NOTE: The LabelEncoder docs state clearly at the start that you shouldn't be using it on your inputs. I'm not going to get into that here though but just be aware that it's not a good encoding.
    encoders[feature] = preprocessing.LabelEncoder()
    X_train[feature] = encoders[feature].fit_transform(X_train[feature])
    X_test[feature] = encoders[feature].transform(X_test[feature])    

#Prepare the model and .fit it.
model = RandomForestClassifier()
model.fit(X_train, y_train)

#Make a prediction on the test set.
predictions = model.predict(X_test)

le_target.inverse_transform(predictions)

这里的关键概念是为功能使用单独的编码器，因为这些编码器对象会记住如何对该功能进行编码。这是在fit阶段完成的。然后，您需要对任何新数据调用transform以对其进行正确编码。

scikit-learn-预测新输入的训练模型

2 个答案:

最后，您的树会预测：