我正在试图建立一个去除异常值的循环:
for (i in 1:ncol(df)){if (class(df[i,])=="numeric"){df[i,] <- df[df[i,] > quantile(df[i,],3/4)*3/2,]}}
我收到此错误:
Error in Ops.data.frame(df[i, ], quantile(df[i, ], 3/4) * 3/2) :
‘>’ only defined for equally-sized data frames
答案 0 :(得分:3)
我们可以通过使用lapply加快速度。
c1 = rnorm(10)
c2 = rnorm(10)
c3 = LETTERS[1:10]
df = cbind.data.frame(c1, c2, c3)
myfun = function(x, probs){
if(class(x) == "numeric"){
x[x > quantile(x, probs)] = NA
return(x)
}else{
return(x)
}
}
示例data.frame是
> df
c1 c2 c3
1 -0.21304047 0.34942938 A
2 0.12141663 -1.41734891 B
3 -0.09297657 0.57998739 C
4 -0.70925140 -0.52620644 D
5 1.02440427 0.02377832 E
6 0.43631554 0.19125312 F
7 0.53268566 2.25430880 G
8 -0.37624920 0.14218233 H
9 0.03863661 -0.44441846 I
10 1.26889396 -0.12077335 J
然后,我会事先记录分位数,以确认这是有效的
> quantile(df$c1, 3/4)
75%
0.5085931
> quantile(df$c2, 3/4)
75%
0.3098853
df = do.call(cbind.data.frame, lapply(df, myfun, 3/4))
> df
c1 c2 c3
1 -0.21304047 NA A
2 0.12141663 -1.41734891 B
3 -0.09297657 NA C
4 -0.70925140 -0.52620644 D
5 NA 0.02377832 E
6 0.43631554 0.19125312 F
7 NA NA G
8 -0.37624920 0.14218233 H
9 0.03863661 -0.44441846 I
10 NA -0.12077335 J
所以我们确实得到了我们期望的输出。
要使用for循环执行此操作,我们可以运行此
for(i in 1:ncol(df)) if(class(df[, i]) == "numeric") df[, i][df[,i] > quantile(df[,i], 3/4)] = NA
这给了我们相同的结果
> df
c1 c2 c3
1 -0.21304047 NA A
2 0.12141663 -1.41734891 B
3 -0.09297657 NA C
4 -0.70925140 -0.52620644 D
5 NA 0.02377832 E
6 0.43631554 0.19125312 F
7 NA NA G
8 -0.37624920 0.14218233 H
9 0.03863661 -0.44441846 I
10 NA -0.12077335 J
然后,如果我们只想保留没有任何NA的行,我们可以运行这个
df = df[complete.cases(df), ]