我有许多不同的表,我想在R中编写一个函数,其中:
表1:
coordinates var1.pred var1.var observed residual zscore fold
1 (2579410, 1079720) 5.057024 0.4325275 5.468 0.41097625 0.62489903 1
2 (2579330, 1079730) 5.329797 0.3945041 4.498 -0.83179667 -1.32431534 2
3 (2579260, 1079770) 4.788211 0.5576228 5.114 0.32578861 0.43628035 3
4 (2579930, 1080030) 5.067753 0.4972365 4.764 -0.30375347 -0.43076434 4
5 (2579700, 1079770) 5.116632 0.5792768 4.626 -0.49063190 -0.64463327 5
6 (2579540, 1079640) 4.865667 0.6122453 6.522 1.65633254 2.11682434 6
7 (2579860, 1079880) 5.139779 0.4655840 4.856 -0.28377887 -0.41589245 7
如果'观察'的值存在于两个以下值的容差中,则将其标记为异常值:
var1.pred+(1.96*sqrt(var1.var))
var1.pred-(.96*sqrt(var1.var))
换句话说,:
if
var1.pred-(1.96*sqrt(var1.var)) < 'observed' < var1.pred-(1.96*sqrt(var1.var))
结果正常,否则结果异常。
此外,列的名称相同,表名为1,2,3 ....
dat <- structure(list(coordinates = structure(c(3L, 2L, 1L, 7L, 5L,
4L, 6L), .Label = c("(2579260, 1079770)", "(2579330, 1079730)",
"(2579410, 1079720)", "(2579540, 1079640)", "(2579700, 1079770)",
"(2579860, 1079880)", "(2579930, 1080030)"), class = "factor"),
var1.pred = c(5.057024, 5.329797, 4.788211, 5.067753, 5.116632,
4.865667, 5.139779), var1.var = c(0.4325275, 0.3945041, 0.5576228,
0.4972365, 0.5792768, 0.6122453, 0.465584), observed = c(5.468,
4.498, 5.114, 4.764, 4.626, 6.522, 4.856), residual = c(0.41097625,
-0.83179667, 0.32578861, -0.30375347, -0.4906319, 1.65633254,
-0.28377887), zscore = c(0.62489903, -1.32431534, 0.43628035,
-0.43076434, -0.64463327, 2.11682434, -0.41589245), fold = 1:7), .Names = c("coordinates",
"var1.pred", "var1.var", "observed", "residual", "zscore", "fold"
), row.names = c(NA, -7L), class = "data.frame")
答案 0 :(得分:5)
这应该有效:
dat$outlier = with(as.data.frame(dat),
ifelse(observed > (var1.pred + (.95*var1.var)) | # | = OR
observed < (var1.pred - (.95*var1.var)),
"outlier", "normal"))
我的代码与您的描述略有不同,因为我检查该值是否在范围之外,而不是在内部。以上运行示例代码:
> dat
coordinates var1.pred var1.var observed residual zscore fold
1 (2579410, 1079720) 5.057024 0.4325275 5.468 0.4109762 0.6248990 1
2 (2579330, 1079730) 5.329797 0.3945041 4.498 -0.8317967 -1.3243153 2
3 (2579260, 1079770) 4.788211 0.5576228 5.114 0.3257886 0.4362803 3
4 (2579930, 1080030) 5.067753 0.4972365 4.764 -0.3037535 -0.4307643 4
5 (2579700, 1079770) 5.116632 0.5792768 4.626 -0.4906319 -0.6446333 5
6 (2579540, 1079640) 4.865667 0.6122453 6.522 1.6563325 2.1168243 6
7 (2579860, 1079880) 5.139779 0.4655840 4.856 -0.2837789 -0.4158925 7
outlier
1 outlier
2 outlier
3 normal
4 normal
5 normal
6 outlier
7 normal
答案 1 :(得分:5)
保罗的回答很好,只是略有不同的建议。
> dat
coordinates var1.pred var1.var observed residual zscore fold
1 (2579410, 1079720) 5.057024 0.4325275 5.468 0.4109762 0.6248990 1
2 (2579330, 1079730) 5.329797 0.3945041 4.498 -0.8317967 -1.3243153 2
3 (2579260, 1079770) 4.788211 0.5576228 5.114 0.3257886 0.4362803 3
4 (2579930, 1080030) 5.067753 0.4972365 4.764 -0.3037535 -0.4307643 4
5 (2579700, 1079770) 5.116632 0.5792768 4.626 -0.4906319 -0.6446333 5
6 (2579540, 1079640) 4.865667 0.6122453 6.522 1.6563325 2.1168243 6
7 (2579860, 1079880) 5.139779 0.4655840 4.856 -0.2837789 -0.4158925 7
> dat$label <- ifelse(dat$observed < dat$var1.pred-(1.96*sqrt(dat$var1.var)) | dat$observed > dat$var1.pred+(1.96*sqrt(dat$var1.var)), "outlier", "normal" )
> dat
coordinates var1.pred var1.var observed residual zscore fold label
1 (2579410, 1079720) 5.057024 0.4325275 5.468 0.4109762 0.6248990 1 normal
2 (2579330, 1079730) 5.329797 0.3945041 4.498 -0.8317967 -1.3243153 2 normal
3 (2579260, 1079770) 4.788211 0.5576228 5.114 0.3257886 0.4362803 3 normal
4 (2579930, 1080030) 5.067753 0.4972365 4.764 -0.3037535 -0.4307643 4 normal
5 (2579700, 1079770) 5.116632 0.5792768 4.626 -0.4906319 -0.6446333 5 normal
6 (2579540, 1079640) 4.865667 0.6122453 6.522 1.6563325 2.1168243 6 outlier
7 (2579860, 1079880) 5.139779 0.4655840 4.856 -0.2837789 -0.4158925 7 normal
更新:顺便说一句,如果你正在寻找这样做的函数,并且正如你所提到的那样,列名总是相同的,你可以把函数写成
checkRange <- function(dat) {
dat$label <- ifelse(dat$observed < dat$var1.pred-(1.96*sqrt(dat$var1.var)) | dat$observed dat$var1.pred+(1.96*sqrt(dat$var1.var)), "outlier", "normal" )
return(dat)
}
> dat <- checkRange(dat)
> dat
coordinates var1.pred var1.var observed residual zscore fold label
1 (2579410, 1079720) 5.057024 0.4325275 5.468 0.4109762 0.6248990 1 normal
2 (2579330, 1079730) 5.329797 0.3945041 4.498 -0.8317967 -1.3243153 2 normal
3 (2579260, 1079770) 4.788211 0.5576228 5.114 0.3257886 0.4362803 3 normal
4 (2579930, 1080030) 5.067753 0.4972365 4.764 -0.3037535 -0.4307643 4 normal
5 (2579700, 1079770) 5.116632 0.5792768 4.626 -0.4906319 -0.6446333 5 normal
6 (2579540, 1079640) 4.865667 0.6122453 6.522 1.6563325 2.1168243 6 outlier
7 (2579860, 1079880) 5.139779 0.4655840 4.856 -0.2837789 -0.4158925 7 normal
答案 2 :(得分:0)
Hamed有好消息。已经存在的包使得这样的工作非常容易。我个人最喜欢的这种工作是'plyr'包装。要在数据框中添加[tolerances]列,可以使用'ddply'函数(输入是数据框,输出是数据框,因此'ply'的'dd'前缀。)
library(plyr)
ddply(dat, .(fold), mutate, tolerances=ifelse(observed < (var1.pred - 0.95 * var1.var)|observed > (var1.pred + 0.95 * var1.var),"Outlier","Normal"))
我将为后代解释每个函数参数,因此您可以根据需要调整它们。我们告诉'ddply'从数据框[dat]加载数据,我们使用[fold]列作为主键。我们必须告诉'ddply'使用什么作为密钥,因为它会自动尝试聚合适合密钥的数据。使用[fold]确保我们得到每一行的结果。接下来,我们使用'mutate'函数,该函数保留所有原始数据,但添加了我们指定的新列。最后,我们指定一个新列[tolerances],其中有一个'ifelse'语句,用于检查[观察]值是否小于较低容差或高于容差上限。如果为true,则值为“Outlier”,如果为false,则值为“Normal”。