我正在尝试从以下数据框中一次删除一个数据集中的异常值:
set.seed(1234)
library('mvoutlier')
x <- rnorm(10) # standard normal
x[1] <- x[1] * 10 # introduce outlier
y <- rnorm(10) # standard normal
y[4] <- y[4] * 10 # introduce outlier
w <- rnorm(10) # standard normal
w[9] <- w[9] * 10 # introduce outlier
grp = c(rep('a',3), rep('b',4), rep('c',3)) #Introduce groups
df = data.frame(grp, x,y,w)
数据框如下所示:
> df
grp x y w
1 a -12.0706575 -0.4771927 0.1340882
2 a 0.2774292 -0.9983864 -0.4906859
3 a 1.0844412 -0.7762539 -0.4405479
4 b -2.3456977 0.6445882 0.4595894
5 b 0.4291247 0.9594941 -0.6937202
6 b 0.5060559 -0.1102855 -1.4482049
7 b -0.5747400 -0.5110095 0.5747557
8 c -0.5466319 -0.9111954 -1.0236557
9 c -0.5644520 -0.8371717 -0.1513830
10 c -0.8900378 2.4158352 -0.9359486
我编写了以下函数来从数据框中删除异常值:
removeOutliers = function(data)
{
print("inside")
print(dim(data))
z = sign2(data[, -which(colnames(data)=="grp")],makeplot=FALSE)
idx = which(z$wfinal01==0) #Get the index of outliers
return(data[-idx,]) #Return the remaining rows
}
我想为每个组分别删除异常值行(即a
,b
,ans c
)。我需要将具有组a
的子数据框传递给上述函数并收集结果并对组b
和c
执行相同操作。
我知道aggregate
函数可以在这里使用,但不知道如何实现这一点。
aggregate( . ~ grp, data=df, removeOutliers)
任何帮助appriciated。谢谢
答案 0 :(得分:1)
这是一种快速方法。 .SD
表示除by
变量之外的所有变量(在此示例中为grp
)。
#Set data as data.table object
require(data.table)
setDT(df)
#Apply function and extract rows where wfinal is 0
tokeep <- df[ , sign2(.SD), by=grp][wfinal01==0,which=TRUE]
#Get rid of outliers
df[-tokeep]
结果数据集没有异常值:
grp x y w
1: a 0.2774292 -0.9983864 -0.4906859
2: a 1.0844412 -0.7762539 -0.4405479
3: b 0.4291247 0.9594941 -0.6937202
4: b 0.5060559 -0.1102855 -1.4482049
5: b -0.5747400 -0.5110095 0.5747557
6: c -0.5466319 -0.9111954 -1.0236557
7: c -0.5644520 -0.8371717 -0.1513830
如果你想要离群值:
df[tokeep]
grp x y w
1: a -12.0706575 -0.4771927 0.1340882
2: b -2.3456977 0.6445882 0.4595894
3: c -0.8900378 2.4158352 -0.9359486
答案 1 :(得分:0)
尝试:
for(i in unique(df$grp)) print(df[grp==i,])
grp x y w
1 a -12.0706575 -0.4771927 0.1340882
2 a 0.2774292 -0.9983864 -0.4906859
3 a 1.0844412 -0.7762539 -0.4405479
grp x y w
4 b -2.3456977 0.6445882 0.4595894
5 b 0.4291247 0.9594941 -0.6937202
6 b 0.5060559 -0.1102855 -1.4482049
7 b -0.5747400 -0.5110095 0.5747557
grp x y w
8 c -0.5466319 -0.9111954 -1.0236557
9 c -0.5644520 -0.8371717 -0.1513830
10 c -0.8900378 2.4158352 -0.9359486
for(i in unique(df$grp)) removeOutliers(df[grp==i,])
[1] "inside"
[1] 3 4
[1] "inside"
[1] 4 4
[1] "inside"
[1] 3 4