胰岛素化助推器和超参数

Question

我正在处理一个包含人员列表（由财务代码索引）的数据集。目标变量是二进制（1：买书，0：否则）。所有预测变量都是分类的（例如：国籍，城市，道路，收入来源等）。财政代码可以重复两次，并且每个实例/观察值都有权重（如果不重复，则为1；如果重复，则为0到1之间的值）。

例如，数据集看起来像

财政代码|重量目标|类别信息

AAAAA1 | 0.98 | 0 | ......

AAAAA1 | 0.02 | 1 | ........

我有两个数据集（具有相同的变量），一个用于 train （X_train =分类变量的矩阵，y_train是目标变量，train_weight是火车集中每个观测值的权重）和 test （具有相同的变量和含义：X_test，y_test和test_weight）之一。

我尝试使用Catboost模型-CatBoostClassifier。

胰岛素化助推器和超参数

categorical_features_indices = np.where（X.dtypes == np.category）[0]

模型= CatBoostClassifier（迭代次数= 5000，学习率= 0.1，深度= 7，损失函数='对数损失'，eval_metric ='AUC'）

健身模型

model.fit（X_train，

        y_train,
         eval_set=(X_test,y_test),
         cat_features=categorical_features_indices,
         use_best_model=True,
         verbose=True,
         sample_weight=train_weight)

问题是：如何考虑TEST集中的观测值也具有权重（test_weight）？你有什么主意吗？

我阅读了https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostregressor_fit-docpage/上的文档，但没有发现有用的东西代替lightgbm文档（如果考虑使用其他增强模型）。

Answer 1

我的理解是，这是您需要使用Pool的情况，即

model.fit(Pool(X_train,y_train,weight=train_weight)
      eval_set=Pool(X_test,y_test,weight=test_weight),
      cat_features=categorical_features_indices,
      use_best_model=True,
      verbose=True)

catboost：评估/具有观察权重的测试集

胰岛素化助推器和超参数

健身模型

1 个答案: