隔离森林

时间:2017-07-06 14:20:48

标签: python scikit-learn outliers anomaly-detection

我目前正在使用Python中的IsolationForest方法在我的数据集中识别异常值,但是没有完全理解sklearn上的示例:

http://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html#sphx-glr-auto-examples-ensemble-plot-isolation-forest-py

具体来说,实际向我们展示的图表是什么?观测结果已被定义为正常/异常值 - 所以我假设等高线图的阴影表明该观测值是否确实是一个异常值(例如,具有较高异常分数的观测值位于较暗的阴影区域?)。

最后,如何实际使用以下代码段(特别是y_pred函数)?

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers) 

我猜它只是为了完整性提供了有人想打印输出的事件?

提前感谢您的帮助!

1 个答案:

答案 0 :(得分:2)

使用您的代码

在您的代码打印 y_pred_outliers

之后
Dim Query As String 
Dimm DT As DataTable = New DataTable 

Query = "select Actual, Description, Unit_of_measurement from Table_ARTIClES WHERE NUMPART = '" & txtPartNum.Text & "'"

Dim Table As SqlDataAdapter = New SqlDataAdapter(Query, conn)
Table.Fill(DT)

lblInventory.Text = DT.Rows(0)("Actual").ToString

因此,对于每次观察,它会根据拟合模型判断是否应将( +1或-1 )视为异常值。

使用Iris数据的简单示例

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers) 

print(y_pred_outliers)

<强>结果:

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)
data = load_iris()

X=data.data
y=data.target
X_outliers = rng.uniform(low=-4, high=4, size=(X.shape[0], X.shape[1]))

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0)

clf = IsolationForest()
clf.fit(X_train)

y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

print(y_pred_test)
print(y_pred_outliers)

解读:

[ 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1] 仅返回 1 。这意味着X_test 的所有样本都不是异常值。

另一方面,print(y_pred_test)仅返回 -1 。这意味着X_outliers 的所有样本(虹膜数据总共150个)都是异常值。

希望这有帮助