R中使用多个数据源的函数

时间:2013-01-28 13:15:10

标签: r function

我有许多不同的表,我想在R中编写一个函数,其中:

表1:

          coordinates var1.pred  var1.var observed    residual      zscore fold
1  (2579410, 1079720)  5.057024 0.4325275    5.468  0.41097625  0.62489903    1
2  (2579330, 1079730)  5.329797 0.3945041    4.498 -0.83179667 -1.32431534    2
3  (2579260, 1079770)  4.788211 0.5576228    5.114  0.32578861  0.43628035    3
4  (2579930, 1080030)  5.067753 0.4972365    4.764 -0.30375347 -0.43076434    4
5  (2579700, 1079770)  5.116632 0.5792768    4.626 -0.49063190 -0.64463327    5
6  (2579540, 1079640)  4.865667 0.6122453    6.522  1.65633254  2.11682434    6
7  (2579860, 1079880)  5.139779 0.4655840    4.856 -0.28377887 -0.41589245    7

如果'观察'的值存在于两个以下值的容差中,则将其标记为异常值:

var1.pred+(1.96*sqrt(var1.var))
var1.pred-(.96*sqrt(var1.var))
换句话说,

      if   
   var1.pred-(1.96*sqrt(var1.var)) < 'observed' <  var1.pred-(1.96*sqrt(var1.var))

结果正常,否则结果异常。

此外,列的名称相同,表名为1,2,3 ....

 dat <- structure(list(coordinates = structure(c(3L, 2L, 1L, 7L, 5L,                                
     4L, 6L), .Label = c("(2579260, 1079770)", "(2579330, 1079730)",                                
     "(2579410, 1079720)", "(2579540, 1079640)", "(2579700, 1079770)",                              
     "(2579860, 1079880)", "(2579930, 1080030)"), class = "factor"),                                
         var1.pred = c(5.057024, 5.329797, 4.788211, 5.067753, 5.116632,                            
         4.865667, 5.139779), var1.var = c(0.4325275, 0.3945041, 0.5576228,                         
         0.4972365, 0.5792768, 0.6122453, 0.465584), observed = c(5.468,                            
         4.498, 5.114, 4.764, 4.626, 6.522, 4.856), residual = c(0.41097625,                        
         -0.83179667, 0.32578861, -0.30375347, -0.4906319, 1.65633254,                              
         -0.28377887), zscore = c(0.62489903, -1.32431534, 0.43628035,                              
         -0.43076434, -0.64463327, 2.11682434, -0.41589245), fold = 1:7), .Names = c("coordinates", 
     "var1.pred", "var1.var", "observed", "residual", "zscore", "fold"                              
     ), row.names = c(NA, -7L), class = "data.frame")  

3 个答案:

答案 0 :(得分:5)

这应该有效:

dat$outlier = with(as.data.frame(dat), 
                   ifelse(observed > (var1.pred + (.95*var1.var)) | # | = OR
                          observed < (var1.pred - (.95*var1.var)),
             "outlier", "normal"))

我的代码与您的描述略有不同,因为我检查该值是否在范围之外,而不是在内部。以上运行示例代码:

> dat
         coordinates var1.pred  var1.var observed   residual     zscore fold
1 (2579410, 1079720)  5.057024 0.4325275    5.468  0.4109762  0.6248990    1
2 (2579330, 1079730)  5.329797 0.3945041    4.498 -0.8317967 -1.3243153    2
3 (2579260, 1079770)  4.788211 0.5576228    5.114  0.3257886  0.4362803    3
4 (2579930, 1080030)  5.067753 0.4972365    4.764 -0.3037535 -0.4307643    4
5 (2579700, 1079770)  5.116632 0.5792768    4.626 -0.4906319 -0.6446333    5
6 (2579540, 1079640)  4.865667 0.6122453    6.522  1.6563325  2.1168243    6
7 (2579860, 1079880)  5.139779 0.4655840    4.856 -0.2837789 -0.4158925    7
  outlier
1 outlier                                                                   
2 outlier                                                                   
3  normal                                                                   
4  normal                                                                   
5  normal                                                                   
6 outlier                                                                   
7  normal 

答案 1 :(得分:5)

保罗的回答很好,只是略有不同的建议。

> dat
         coordinates var1.pred  var1.var observed   residual     zscore fold
1 (2579410, 1079720)  5.057024 0.4325275    5.468  0.4109762  0.6248990    1
2 (2579330, 1079730)  5.329797 0.3945041    4.498 -0.8317967 -1.3243153    2
3 (2579260, 1079770)  4.788211 0.5576228    5.114  0.3257886  0.4362803    3
4 (2579930, 1080030)  5.067753 0.4972365    4.764 -0.3037535 -0.4307643    4
5 (2579700, 1079770)  5.116632 0.5792768    4.626 -0.4906319 -0.6446333    5
6 (2579540, 1079640)  4.865667 0.6122453    6.522  1.6563325  2.1168243    6
7 (2579860, 1079880)  5.139779 0.4655840    4.856 -0.2837789 -0.4158925    7

> dat$label <- ifelse(dat$observed < dat$var1.pred-(1.96*sqrt(dat$var1.var)) |  dat$observed > dat$var1.pred+(1.96*sqrt(dat$var1.var)), "outlier", "normal" )

> dat
         coordinates var1.pred  var1.var observed   residual     zscore fold   label
1 (2579410, 1079720)  5.057024 0.4325275    5.468  0.4109762  0.6248990    1  normal
2 (2579330, 1079730)  5.329797 0.3945041    4.498 -0.8317967 -1.3243153    2  normal
3 (2579260, 1079770)  4.788211 0.5576228    5.114  0.3257886  0.4362803    3  normal
4 (2579930, 1080030)  5.067753 0.4972365    4.764 -0.3037535 -0.4307643    4  normal
5 (2579700, 1079770)  5.116632 0.5792768    4.626 -0.4906319 -0.6446333    5  normal
6 (2579540, 1079640)  4.865667 0.6122453    6.522  1.6563325  2.1168243    6 outlier
7 (2579860, 1079880)  5.139779 0.4655840    4.856 -0.2837789 -0.4158925    7  normal

更新:顺便说一句,如果你正在寻找这样做的函数,并且正如你所提到的那样,列名总是相同的,你可以把函数写成

checkRange <- function(dat) {
  dat$label <- ifelse(dat$observed < dat$var1.pred-(1.96*sqrt(dat$var1.var)) |  dat$observed   dat$var1.pred+(1.96*sqrt(dat$var1.var)), "outlier", "normal" )
  return(dat)
}
> dat <- checkRange(dat)

> dat
         coordinates var1.pred  var1.var observed   residual     zscore fold   label
1 (2579410, 1079720)  5.057024 0.4325275    5.468  0.4109762  0.6248990    1  normal
2 (2579330, 1079730)  5.329797 0.3945041    4.498 -0.8317967 -1.3243153    2  normal
3 (2579260, 1079770)  4.788211 0.5576228    5.114  0.3257886  0.4362803    3  normal
4 (2579930, 1080030)  5.067753 0.4972365    4.764 -0.3037535 -0.4307643    4  normal
5 (2579700, 1079770)  5.116632 0.5792768    4.626 -0.4906319 -0.6446333    5  normal
6 (2579540, 1079640)  4.865667 0.6122453    6.522  1.6563325  2.1168243    6 outlier
7 (2579860, 1079880)  5.139779 0.4655840    4.856 -0.2837789 -0.4158925    7  normal

答案 2 :(得分:0)

Hamed有好消息。已经存在的包使得这样的工作非常容易。我个人最喜欢的这种工作是'plyr'包装。要在数据框中添加[tolerances]列,可以使用'ddply'函数(输入是数据框,输出是数据框,因此'ply'的'dd'前缀。)

library(plyr)
ddply(dat, .(fold), mutate, tolerances=ifelse(observed < (var1.pred - 0.95 * var1.var)|observed > (var1.pred + 0.95 * var1.var),"Outlier","Normal"))

我将为后代解释每个函数参数,因此您可以根据需要调整它们。我们告诉'ddply'从数据框[dat]加载数据,我们使用[fold]列作为主键。我们必须告诉'ddply'使用什么作为密钥,因为它会自动尝试聚合适合密钥的数据。使用[fold]确保我们得到每一行的结果。接下来,我们使用'mutate'函数,该函数保留所有原始数据,但添加了我们指定的新列。最后,我们指定一个新列[tolerances],其中有一个'ifelse'语句,用于检查[观察]值是否小于较低容差或高于容差上限。如果为true,则值为“Outlier”,如果为false,则值为“Normal”。