r通过ddply从具有两个标识符的数据框中删除异常值

时间:2018-06-08 20:14:31

标签: r plyr outliers

首先,我应该声明我对R语言并不十分熟悉。我有一个大的长格式数据框,例如下面的df,有3列:GroupIDdat。我希望删除每个" group-id"中的异常值(或者更确切地说用平均值替换)。

Group = c("1","1","2","2","3","3","1","1","2","2","3","3","1","1","2","2","3","3","1","1","2","2","3","3")
ID = c("Eb","Eb","Eb","Eb","Eb","Eb","Sd","Sd","Sd","Sd","Sd","Sd","Re","Re","Re","Re","Re","Re","Tf","Tf","Tf","Tf","Tf","Tf")
dat = c(2,3,4,5,6,7,8,9,1010,11,12,13,1,2,3,-10000,5,6,4,3,2,7,6666,5)
df = data.frame(Group,ID,dat)

我的基本方法(不起作用)如下(我已尝试过几次此代码的迭代):

library(outliers)
library(plyr)
# Function to remove outliers
RmOurliFUN = function(x){
                rm.outlier(x$dat, fill = TRUE)
}
# splitting data based on first Group, and then ID to apply the outlier removal
GroupSplit = function(x){ddply(x,"ID",RmOurliFUN)}
df2 = ddply(df1, "Group", GroupSplit)

我收到各种错误消息,但通常认为参数不是数字或逻辑。我很确定我没有在嵌套的>嵌套函数中正确调用dat列。 如何执行这样的操作?我对任何建议持开放态度。

1 个答案:

答案 0 :(得分:1)

要删除Group+ID的每个唯一组合中的异常值,您可以将该函数直接添加到ddply的调用中,然后重新整形结果

library(outliers)
library(plyr)
library(reshape2)

#Make some new categories to have enough values for outlier detection
Group<-rep(c("a", "b"), each=12)
ID<-rep(c("c", "d"), each=6)
dat = c(2,3,4,5,6,7,8,9,1010,11,12,13,1,2,3,-10000,5,6,4,3,2,7,6666,5)
df1 = data.frame(Group,ID,dat)

df2<-ddply(df1, c("Group", "ID"), function(x) rm.outlier(x$dat, fill=TRUE))

#reshape and order the data
res<-melt(df2, id.vars=c("Group", "ID"), value.name = "dat")  
res<-arrange(res, Group, ID)[,-3]