Question

我正在对具有2个类的mulcross数据库进行隔离林集群。我将数据分为训练集和测试集，并尝试在测试集上计算准确性得分，roc_auc_score和confusion_matrix。但是有两个问题：第一个问题是，在聚类方法中，我不应在训练阶段使用标签，这意味着不应提及“ y_train” ，但是我没有找到另一个评估我的模型的解决方案。我发现的结果不正确。 我的问题是如何评估隔离林之类的集群模型。 这是我的代码：

df = pd.read_csv('db.csv')
y_true=df['Target']
df_data=df.drop('Target',1)

X_train, X_test, y_train, y_test = train_test_split(df_data, y_true, test_size=0.3, random_state=42)

alg=IsolationForest(n_estimators=100, max_samples= 256 , contamination=0.1, max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0, behaviour="new")
model = alg.fit(X_train, y_train)
preds = alg.predict(X_test)

print("#############################\n#############################")
print(accuracy_score(y_test, preds))
print(roc_auc_score(y_test, preds))
cm = confusion_matrix(y_test, preds)
print(cm)
print("#############################\n#############################")

Answer 1

我不明白您为什么要聚类并将其分为训练/测试集。在我看来，您似乎正在混合分类/集群或类似的东西。如果您有标签，请尝试一种有监督的方法。容易赢的钱是xgboost，随机森林，GLM，物流等...

如果要评估聚类方法，可以调查聚类间和聚类内的距离。归根结底，您想要的是小型且分隔良好的群集。您也可以查看一个称为轮廓的指标。

您也可以尝试

print("Accuracy:", list(y_pred_test).count(1)/y_pred_test.shape[0])

另外，请查看here了解更多详细信息。

Python：评估隔离林

1 个答案: