Question

我想删除数据集的每个群集的异常值。数据集包含3列具有不同变量的列和一列指示每个点分配到的集群。如果3个变量中只有一个是异常值，则将删除整行。识别异常值确定跨越平均值加/减三个标准差的间隔，但我也可以使用import paramiko from paramiko import SSHClient, AutoAddPolicy LOCAL_IP=IP PORT=your_port client = paramiko.SSHClient() client.set_missing_host_key_policy(paramiko.AutoAddPolicy()) client.connect(LOCAL_IP, PORT, username="username", password="password") stdin, stdout, stderr = client.exec_command("pwd") print(stdout.read())函数。

我可以在不考虑群集的情况下删除异常值，使用：

outlier

但是，我无法检测出考虑群集的异常值，从而确定每个群集内的间隔，而不是整个群体内的间隔。我想嵌套另一个循环，但我发现很难编码。任何帮助将不胜感激。

Answer 1

这是一个dplyr方法：

library(dplyr)
dat %>% 
  group_by(k) %>% 
  filter_all(all_vars((abs(mean(.) - .) < 3*sd(.))))

# # A tibble: 100 x 4
# # Groups:   k [5]
# v1    v2    v3     k
# <int> <int> <int> <int>
#   1     9    20    30     1
# 2     5    24    35     2
# 3     8    20    30     3
# 4     8    23    32     4
# 5     6    23    35     5
# 6     9    24    32     1
# 7     9    22    33     2
# 8     9    23    31     3
# 9     7    21    35     4
# 10     9    23    32     5
# # ... with 90 more rows

Answer 2

基地R：

dat <- cbind.data.frame(v1=c(sample(5:10, 100,replace=T),sample(1:5,5)),
                        v2=c(sample(20:25, 100,replace=T),sample(5:10,5)),
                        v3=c(sample(30:35, 100,replace=T),sample(10:20,5)),
                        k=c(rep(1:5,21)))

get_remove <- function(x, index, a = 3) {
  lower_limit <- tapply(x, index, function(x) mean(x) - a * sd(x))
  upper_limit <- tapply(x, index, function(x) mean(x) + a * sd(x))
  vals <- split(x, index)
  res <- sapply(seq_along(vals), function(i) 
    ((vals[[i]] < lower_limit[i]) | (vals[[i]] > upper_limit[i])))
}
mask <- apply(do.call(cbind, 
                      lapply(dat[ , c("v1", "v2", "v3")], 
                             get_remove, dat$k)),
              MARGIN = 1, any)
dat[!mask, ] 
print("removed:")
dat[mask, ]

R中的嵌套循环用于检测异常值

2 个答案: