通过示例了解LocalOutlinerFactor算法

时间:2018-08-07 14:41:33

标签: python machine-learning scikit-learn data-analysis

因此,我研究了LocalOutliner Detection的sklearn示例,并尝试将其应用于我拥有的示例数据集。但是某种程度上,结果本身对我来说并没有任何意义。

我已实现的内容如下:(不包括导入内容)

import numpy as np
import matplotlib.pyplot as plt
import pandas
from sklearn.neighbors import LocalOutlierFactor


# import file
url = ".../Python/outliner.csv"
names = ['R1', 'P1', 'T1', 'P2', 'Flag']
dataset = pandas.read_csv(url, names=names)    

array = dataset.values
X = array[:,0:2] 
rng = np.random.RandomState(42)


# fit the model
clf = LocalOutlierFactor(n_neighbors=50, algorithm='auto', leaf_size=30)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[500:]

# plot the level sets of the decision function
xx, yy = np.meshgrid(np.linspace(0, 1000, 50), np.linspace(0, 200, 50))
Z = clf._decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title("Local Outlier Factor (LOF)")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

a = plt.scatter(X[:200, 0], X[:200, 1], c='white',
                edgecolor='k', s=20)
b = plt.scatter(X[200:, 0], X[200:, 1], c='red',
                edgecolor='k', s=20)
plt.axis('tight')
plt.xlim((0, 1000))
plt.ylim((0, 200))
plt.legend([a, b],
           ["normal observations",
            "abnormal observations"],
           loc="upper left")
plt.show()

我得到这样的内容:LOF Outliner Detection

有人可以告诉我为什么检测失败吗?

我尝试使用参数和范围,但对大纲检测器本身没有太大的更改。

如果有人可以指出我在这个问题上的正确方向,那就太好了。谢谢

编辑:添加了导入:File

1 个答案:

答案 0 :(得分:1)

我假设您关注了this example。该示例尝试比较实际/观测数据(散点图​​)与从中学习的决策函数(轮廓图)。由于数据是已知的/组成的(200个正常值+ 20个离群值),我们可以简单地通过使用X[200:](从第200个索引开始)选择离群值,并使用X[:200](从0至199个索引)选择正常值

因此,如果要绘制预测结果(作为散点图)而不是实际/观测数据,则需要执行以下代码。基本上,您是根据X({1:正常,-1:离群值)分割y_pred,然后在散点图中使用它:

import numpy as np
import matplotlib.pyplot as plt
import pandas
from sklearn.neighbors import LocalOutlierFactor

# import file
url = ".../Python/outliner.csv"
names = ['R1', 'P1', 'T1', 'P2', 'Flag']
dataset = pandas.read_csv(url, names=names)
X = dataset.values[:, 0:2]

# fit the model
clf = LocalOutlierFactor(n_neighbors=50, algorithm='auto', leaf_size=30)
y_pred = clf.fit_predict(X)

# map results
X_normals = X[y_pred == 1]
X_outliers = X[y_pred == -1]

# plot the level sets of the decision function
xx, yy = np.meshgrid(np.linspace(0, 1000, 50), np.linspace(0, 200, 50))
Z = clf._decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title("Local Outlier Factor (LOF)")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

a = plt.scatter(X_normals[:, 0], X_normals[:, 1], c='white', edgecolor='k', s=20)
b = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='red', edgecolor='k', s=20)
plt.axis('tight')
plt.xlim((0, 1000))
plt.ylim((0, 200))
plt.legend([a, b], ["normal predictions", "abnormal predictions"], loc="upper left")
plt.show()

如您所见,普通数据的散点图将遵循等高线图:

enter image description here