Question

确定无疑这与another question here有关，但没有回应，我怀疑是因为我太复杂了。所以我问这个问题不同，因为它是简化的。如果这是不可接受的话，很高兴被责骂。

我的核心问题是我想通过仅包含每列的异常值来创建数据帧。数据框如下所示：

 chr   leftPos         TBGGT     12_try      324Gtt       AMN2
  1     24352           34         43          19         43
  1     53534           2          1           -1         -9
  2      34            -15         7           -9         -18
  3     3443           -100        -4          4          -9
  3     3445           -100        -1          6          -1
  3     3667            5          -5          9           5
  3     7882           -8          -9          1           3

我想计算每列的上限和下限（从第三个开始），排除所有属于限制的行，因此我只保留异常值，然后按如下方式结束数据帧（对于每列）。然后这个数据帧被传递给代码的下一位（在循环中），但为了简单起见，我不会详细说明这一点

chr   leftPos         TBGGT
 2      34            -15        
 3     3443           -100       
 3     3445           -100

到目前为止我的代码：

alpha = 1.5

 f1 <- function(df, ZCol){

  # Determine the UL and LL and then generate the Zoutliers
  UL = median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE)
  LL = median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE)
  Zoutliers <- which(ZCol > UL | ZCol < LL)}

但这只是给了我离群值而没有与之关联的chr和leftPos。我怎么得到这个？

Answer 1

也许这个：

DF <- read.table(text=" chr   leftPos         TBGGT     12_try      324Gtt       AMN2
  1     24352           34         43          19         43
  1     53534           2          1           -1         -9
  2      34            -15         7           -9         -18
  3     3443           -100        -4          4          -9
  3     3445           -100        -1          6          -1
  3     3667            5          -5          9           5
  3     7882           -8          -9          1           3", header = TRUE)

#fix your function as explained by @Thilo
#also make alpha a parameter with default value
f1 <- function(ZCol, alpha = 1.5){  
  UL <- median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE)
  LL <- median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE)
  ZCol > UL | ZCol < LL
}

#loop over the columns and subset with the function's logical return values
outlist <- lapply(3:6, function(i, df) {
  df[f1(df[,i]), c(1:2, i)]  
}, df = DF)


#[[1]]
#  chr leftPos TBGGT
#4   3    3443  -100
#5   3    3445  -100
#
#[[2]]
#  chr leftPos X12_try
#1   1   24352      43
#
#[[3]]
#  chr leftPos X324Gtt
#1   1   24352      19
#3   2      34      -9
#
#[[4]]
#  chr leftPos AMN2
#1   1   24352   43

Answer 2

你基本上自己提供了答案，你刚刚错过了最后一个最终链接。

您的函数计算您为异常值定义的限制。我们更改结果，使其返回一个布尔向量，如果值为异常值，则为true：

isOutlier <- function(values) {
  # Determine the UL and LL
  UL <- median(values, na.rm = TRUE) + alpha*IQR(values, na.rm = TRUE)
  LL <- median(values, na.rm = TRUE) - alpha*IQR(values, na.rm = TRUE)
  values > UL | values < LL  # Return a boolean vector that can be used as a filter later on. 
}

现在，您只需使用此功能即可对数据框进行子集化，即

AMN2.outliers <- subset(df, isOutlier(AMN2))

或

AMN2.outliers <- df[isOutlier(AMN2),]

无论哪种套房更多。当然你也可以在函数中包含这一行，但为了便于阅读，我首选上面的解决方案。

此外：我建议使用<-运算符代替=进行分配。请参阅here。

从函数中的异常值创建数据框

2 个答案: