Question

我的数据框如下：

 chr   leftPos         TBGGT     12_try      324Gtt       AMN2
  1     24352           34         43          19         43
  1     53534           2          1           -1         -9
  2      34            -15         7           -9         -18
  3     3443           -100        -4          4          -9
  3     3445           -100        -1          6          -1
  3     3667            5          -5          9           5
  3     7882           -8          -9          1           3

我必须创建一个循环：

a）从第三列开始计算每列的上限和下限（UL和LL） b）仅包括UL和LL（Zoutliers）之外的行 c）然后计算Zoutlier与前一个相同方向（即正或负）的行数，以及相同chr 的后续行。

因此输出为：

ZScore1 TBGGT 12_try 324Gtt AMN2 nrow 4 6 4 4

到目前为止，我的代码如下：

library(data.table)#v1.9.5 f1 <- function(df, ZCol){ #A) Determine the UL and LL and then generate the Zoutliers UL = median(ZCol, na.rm = TRUE) + alpha*IQR(ZCol, na.rm = TRUE) LL = median(ZCol, na.rm = TRUE) - alpha*IQR(ZCol, na.rm = TRUE) Zoutliers <- which(ZCol > UL | ZCol < LL) #B) Exclude Zoutliers per chr if same direction as previous or subsequent row na.omit(as.data.table(df)[, {tmp = sign(eval(as.name(ZCol))) .SD[tmp==shift(tmp) | tmp==shift(tmp, type='lead')]}, by=chr])[, list(.N)]} nm1 <- paste0(names(df) setnames(do.call(cbind,lapply(nm1, function(x) f1(df, x))), nm1)[]

代码从各个地方拼凑而成。我遇到的问题是组合代码的A）部分和B部分以获得我想要的输出

Answer 1

你能试试这个功能吗？我不确定alpha是什么，所以我无法重现预期的输出并将其作为变量包含在函数中。

# read your data per copy&paste
d <- read.table("clipboard",header = T)
# or as in Frank comment mentioned solution via fread
d <- data.table::fread("chr   leftPos         TBGGT     12_try      324Gtt       AMN2
                                     1     24352           34         43          19         43
                                     1     53534           2          1           -1         -9
                                     2      34            -15         7           -9         -18
                                     3     3443           -100        -4          4          -9
                                     3     3445           -100        -1          6          -1
                                     3     3667            5          -5          9           5
                                     3     7882           -8          -9          1           3")


# set up the function
foo <- function(x, alpha, chr){
  # your code for task a) and b)
  UL = median(x, na.rm = TRUE) + alpha*IQR(x, na.rm = TRUE)
  LL = median(x, na.rm = TRUE) - alpha*IQR(x, na.rm = TRUE)
  Zoutliers <- which(x > UL | x < LL)
  # part (c
  # factor which specifies the direction. 0 values are set as positives
  pos_neg <- ifelse(x[Zoutliers] >= 0, "positive", "negative")
  # count the occurrence per chromosome and direction.
  aggregate(x[Zoutliers], list(chr[Zoutliers], pos_neg), length)
}

# apply over the columns and get a list of dataframes with number of outliers per chr and direction.
apply(d[,3:ncol(d)], 2, foo, 0.95, d$chr)

仅包含数据框中每列的异常值

1 个答案: