如何使用precision_score(或其他建议的功能)测量xgboost回归器的准确性

时间:2019-12-03 22:25:14

标签: python xgboost training-data inventory-management k-fold

我正在编写代码来解决一个简单的问题,即预测库存中物品丢失的可能性。

我使用的是 XGBoost 预测模型。

我将数据分为两个.csv文件,一个包含 Train Data ,另一个包含 Test Data

代码如下:

    import pandas as pd
    import numpy as np


    train = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/train.csv', index_col='sku').fillna(-1)
    test = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/test.csv', index_col='sku').fillna(-1)


    X_train, y_train = train.drop('isBackorder', axis=1), train['isBackorder']

    import xgboost as xgb
    xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                    max_depth = 10, alpha = 10, n_estimators = 10)
    xg_reg.fit(X_train,y_train)


    y_pred = xg_reg.predict(test)

    # Create file for the competition submission
    test['isBackorder'] = y_pred
    pred = test['isBackorder'].reset_index()
    pred.to_csv('competitionsubmission.csv',index=False)

这是我尝试测量问题准确性的函数(使用RMSE和precision_scores函数并进行KFold交叉验证

#RMSE
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_train, y_pred))
print("RMSE: %f" % (rmse))


#Accuracy
from sklearn.metrics import accuracy_score

# make predictions for test data
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))


#KFold
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# CV model
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(xg_reg, X_train, y_train, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

但是我遇到了一些问题。

以上准确性测试均无效。

使用 RMSE 函数和 Accuracy 函数时,出现以下错误: ValueError:找到样本数量不一致的输入变量:[1350955,578982]

我猜我正在使用的训练和测试数据拆分结构不正确。

由于我没有y_test(并且我不知道如何在问题中创建它),因此无法在函数的上述参数中使用它。

K折验证也不起作用。

有人可以帮我吗?

1 个答案:

答案 0 :(得分:0)

您唯一的问题是您需要验证数据。您无法测量predict(x_test)与不存在的y_test之间的准确性。使用sklearn.model_selection.train_test_split根据您的训练数据进行验证。您将获得培训,验证和测试集。您可以在验证集上评估模型的性能。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y)

其他备注:

这里的精度没有意义,因为您试图预测连续值。仅将精度用于分类变量。

至少可以这样做:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

train = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/train.csv', index_col='sku').fillna(-1)
test_data = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o '
                    'periodo/Python/Trabalho Final/test.csv', index_col='sku').fillna(-1)

x, y = train.drop('isBackorder', axis=1), train['isBackorder']
X_train, X_test, y_train, y_test = train_test_split(x, y)

xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 10, alpha = 10, n_estimators = 10)

xg_reg.fit(X_train,y_train)

kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(xg_reg, X_train, y_train, cv=kfold)
y_test_pred = xg_reg.predict(X_test)

mse = mean_squared_error(y_test_pred, y_test)

y_pred = xg_reg.predict(X_test)

pd.DataFrame(y_pred).to_csv('competitionsubmission.csv',index=False)