我正在尝试重用已保存的XGBClassifer,该XGBClassifer经过训练后会通过.predict_proba()方法做出以下预测:
ID-得分
1-0.072253475
2-0.165827038
3-0.098182471
4-0.148302316
但是,在腌制对象或使用Skleans Joblib模块重新加载对象之后,即使使用了完全相同的测试集,预测也完全不可用:
ID-得分
1-0.46986327
2-0.63513994
3-0.45958066
4-0.8958819
这是分类器:
XGBClassifier(base_score=0.5, booster='gbtree',colsample_bylevel=1,
colsample_bytree=0.8, gamma=1, learning_rate=0.01, max_delta_step=0,
max_depth=4, min_child_weight=1, missing=nan, n_estimators=1500,
n_jobs=-1, nthread=None, objective='binary:logistic',
random_state=777, reg_alpha=2, reg_lambda=1,
scale_pos_weight=0.971637216356233, seed=777, silent=True,
subsample=0.6, verbose=2)
我使用两种不同的方法来腌制对象,即sklearn包中提供的Joblib和使用以下功能的标准pickle.dump:
def save_as_pickled_object(obj, filepath):
import pickle
import os
import sys
"""
This is a defensive way to write pickle.write, allowing for very large files on all platforms
"""
max_bytes = 2**31 - 1
"""
Adding protocol = 4 as an argument to pickle.dumps because it allows for seralizing data greater than 4GB
reference link: https://stackoverflow.com/questions/29704139/pickle-in-python3-doesnt-work-for-large-data-saving
"""
bytes_out = pickle.dumps(obj, protocol=4)
n_bytes = sys.getsizeof(bytes_out)
with open(filepath, 'wb') as f_out:
for idx in range(0, n_bytes, max_bytes):
f_out.write(bytes_out[idx:idx+max_bytes])
def try_to_load_as_pickled_object_or_None(filepath):
import pickle
import os
import sys
"""
This is a defensive way to write pickle.load, allowing for very large files on all platforms
"""
max_bytes = 2**31 - 1
try:
input_size = os.path.getsize(filepath)
bytes_in = bytearray(0)
with open(filepath, 'rb') as f_in:
for _ in range(0, input_size, max_bytes):
bytes_in += f_in.read(max_bytes)
obj = pickle.loads(bytes_in)
except:
return None
return obj
无论我如何保存文件(joblib或pickle),结果都是相同的。也就是说,预测的Proba分数与训练后立即在XGBClassifier对象上使用该方法时完全不同。
另一方面,在同一个内核上,我正在使用SGDClassifier做同样的事情,为此我没有遇到同样的问题。
我注意到的一件奇怪的事情是,如果我在训练分类器后将其保存,然后将其加载到同一内核会话中(使用Jupyterlab),则概率是相同的。但是,如果我重新启动内核,并通过上述两种方法之一加载对象,那么我将不再有相同的概率。
我的期望是,我应该使用XGBClassifier上的预测proba方法获得相同或几乎相同的概率。
谢谢
答案 0 :(得分:0)
我想,转储和读取模型的方式可能有些有趣,或者它是您使用的xgboost版本的功能。
我可以通过在笔记本中使用以下代码简单地加载持久性XGB模型来完全重现预测的概率(重复“加载模型” 部分和内核重新启动后的初始导入)
import os
import numpy as np
import pandas as pd
import pickle
import joblib
import xgboost as xgb
## Training a model
np.random.seed(312)
train_X = np.random.random((10000,10))
train_y = np.random.randint(0,2, train_X.shape[0])
val_X = np.random.random((10000,10))
val_y = np.random.randint(0,2, train_y.shape[0])
xgb_model_mpg = xgb.XGBClassifier(max_depth= 3)
_ = xgb_model_mpg.fit(train_X, train_y)
print(xgb_model_mpg.predict_proba(val_X))
## Save the model
with open('m.pkl', 'wb') as fout:
pickle.dump(xgb_model_mpg, fout)
joblib_dump(xgb_model_mpg, 'm.jlib')
## Load the model
m_jlb = joblib.load('m.jlib')
m_pkl = pickle.load( open( "m.pkl", "rb" ) )
print(m_jlb.predict_proba(val_X))
print(m_pkl.predict_proba(val_X))
我在xgboost 0.71
的普通jupyter笔记本中使用joblib 0.12.4
和python 3.5.5