Question

在聚类分析中，可以通过单连锁方法轻松识别数据集的异常值。现在我想自动删除异常值。我的想法是删除超过指定距离值的数据。这是我的代码，其中包含mtcars的示例数据：

library(cluster)
library(dendextend)
cluster<-agnes(mtcars,stand=FALSE,method="single")
dend = as.dendrogram(cluster)

在Plot中，您可以看到生成的树形图。最后4辆车（“Duster 360”，“Camaro Z28”，“Ford Pantera L”，“玛莎拉蒂Bora”）被识别出异常值，所以我想删除它们的数据集mtcars的行数。我怎么能自动完成？例如。删除高度超过70的行？我已经尝试了很多移除异常值的可能性，但它们似乎并不适用于我的数据。

非常感谢！

Answer 1

如果您的“规则”是链接距离，那么您基本上重新创建了最近邻居异常值检测，这是数据挖掘中较旧的异常值方法之一。

Ramaswamy，Sridhar，Rajeev Rastogi和Kyuseok Shim。 “从大型数据集中挖掘异常值的高效算法。” ACM Sigmod记录。卷。 29. No. 2. ACM，2000。

除了与AGNES的单链接需要O（n³）时间，但索引可以在O（n log n）中执行kNN异常值。

Answer 2

试试这个：

# your code
library(cluster)
cluster<-agnes(mtcars,stand=FALSE,method="single")
dend = as.dendrogram(cluster)
plot(dend)

#new code    
hclu <- as.hclust(cluster) # convert to list that cutree() understands 
groupindexes <- cutree(hclu, h = 70) # cut at height 70 - creates 3 groups/branches
mtcars[groupindexes != 1,] # "outliers" - not in group 1 but in groups 2 and 3
mtcars[groupindexes == 1,] # all but the 4 "outliers"

结果1 - ＆＃34;异常值＆＃34;：

                mpg cyl disp  hp drat   wt  qsec vs am gear carb
Duster 360     14.3   8  360 245 3.21 3.57 15.84  0  0    3    4
Camaro Z28     13.3   8  350 245 3.73 3.84 15.41  0  0    3    4
Ford Pantera L 15.8   8  351 264 4.22 3.17 14.50  0  1    5    4
Maserati Bora  15.0   8  301 335 3.54 3.57 14.60  0  1    5    8

结果2：

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
(....and ~30 other rows ....)

自动删除计算的凝聚层次聚类数据的异常值

2 个答案: