Question

我正在研究一些生物学数据，这些数据基本上是一个长整数（几百万个值）的整数，每个都说明这个位置在基因组中的表现如何。以下是数据集的图形示例： alt text

我想在这些数据中寻找“山谷”，即明显低于周围环境的地区。

请注意，我正在寻找的山谷的大小并不是真正知道的 - 它的范围可以从50个基数到几千个。定义什么是山谷当然是我正在努力解决的问题之一，但前面的例子对我来说相对容易： alt text

您会建议使用哪种范例来查找这些山谷？我主要使用Perl和R编程。

谢谢！

Answer 1

我们使用运行中位数和中位数绝对偏差进行峰值检测（和谷值检测）。您可以指定与运行中位数的偏差意味着峰值。

在下一步中，我们使用二项式模型来检查哪些区域包含比预期更多的“极端”值。该模型（基本上是分数测试）导致“峰值区域”而不是单个峰值。转过身来获得“山谷地区”是微不足道的。

使用包aroma的包中的函数weightedMedian计算运行中值。我们使用embed（）函数制作一个“窗口”列表并在其上应用内核函数。

加权中位数的应用：

center <- apply(embed(tmp,wdw),1,weightedMedian,w=weights,na.rm=T)

这里tmp是临时数据向量，wdw是窗口大小（必须是不均匀的）。通过在数据矢量的每一侧添加（wdw-1）/ 2个NA值来构造tmp。权重是使用定制功能构建的。对于疯狂，我们使用相同的过程，但然后在diff（数据）而不是数据本身。

运行示例代码：

require(aroma.light)
# make.weights : function to make weights on basis of a normal distribution
# n is window size !!!!!!
make.weights <- function(n,
      type=c("gaussian","epanechnikov","biweight","triweight","cosinus")){
    type <- match.arg(type)
    x <- seq(-1,1,length.out=n)
    out <-switch(type,
          gaussian=(1/sqrt(2*pi)*exp(-0.5*(3*x)^2)),
          epanechnikov=0.75*(1-x^2),
          biweight=15/16*(1-x^2)^2,
          triweight=35/32*(1-x^2)^3,
          cosinus=pi/4*cos(x*pi/2),
          )
    out <- out/sum(out)*n
    return(out)
}

# score.test : function to become a p-value based on the score test
# uses normal approximation, but is still quite correct when p0 is
# pretty small.
# This test is one-sided, and tests whether the observed proportion
# is bigger than the hypothesized proportion
score.test <- function(x,p0,w){
    n <- length(x)
    if(missing(w)) w<-rep(1,n)
    w <- w[!is.na(x)]
    x <- x[!is.na(x)]

    if(sum(w)!=n) w <- w/sum(w)*n

    phat <- sum(x*w)/n
    z <- (phat-p0)/sqrt(p0*(1-p0)/n)
    p <- 1-pnorm(z)
    return(p)
}

# embed.na is a modification of embed, adding NA strings
# to the beginning and end of x. window size= 2n+1
embed.na <- function(x,n){
    extra <- rep(NA,n)
    x <- c(extra,x,extra)
    out <- embed(x,2*n+1)
    return(out)
}

# running.score : function to calculate the weighted p-value for the chance of being in
# a run of peaks. This chance is based on the weighted proportion of the neighbourhood
# the null hypothesis is calculated by taking the weighted proportion
# of detected peaks in the whole dataset.
# This lessens the need for adjusting parameters and makes the
# method more automatic.
# for a correct calculation, the weights have to sum up to n

running.score <- function(sel,n=20,w,p0){
    if(missing(w)) w<- rep(1,2*n+1)
    if(missing(p0))p0 <- sum(sel,na.rm=T)/length(sel[!is.na(sel)])   # null hypothesis
    out <- apply(embed.na(sel,n),1,score.test,p0=p0,w=w)
    return(out)
}

# running.med : function to calculate the running median and mad
# for a dataset. Window size = 2n+1
running.med <- function(x,w,n,cte=1.4826){
    wdw <- 2*n+1
    if(missing(w)) w <- rep(1,wdw)

    center <- apply(embed.na(x,n),1,weightedMedian,w=w,na.rm=T)
    mad <- median(abs(x-center))*cte
    return(list(med=center,mad=mad))
}

##############################################
#
# Create series
set.seed(100)
n = 1000
series <- diffinv(rnorm(20000),lag=1)

peaks <- apply(embed.na(series,n),1,function(x) x[n+1] < quantile(x,probs=0.05,na.rm=T))

pweight <- make.weights(0.2*n+1)
p.val <- running.score(peaks,n=n/10,w=pweight)

plot(series,type="l")
points((1:length(series))[p.val<0.05],series[p.val<0.05],col="red")
points((1:length(series))[peaks],series[peaks],col="blue")

上面的示例代码用于查找波动较大而非山谷的区域。我调整了一下，但它不是最佳的。最重要的是，对于超过20000个值的系列，你需要大量的内存，我不能再在我的计算机上运行它了。

或者，您可以使用近似的数值导数和二阶导数来定义山谷。在您的情况下，这甚至可能更好。计算衍生物和一阶导数的最小值/最大值的实用方法：

#first derivative
f.deriv <- diff(lowess(series,f=n/length(series),delta=1)$y)
#second derivative
f.sec.deriv <- diff(f.deriv)
#minima and maxima defined by where f.sec.deriv changes sign :
minmax <- cumsum(rle(sign(f.sec.deriv))$lengths)

op <- par(mfrow=c(2,1))
plot(series,type="l")
plot(f.deriv,type="l")
points((1:length(f.deriv))[minmax],f.deriv[minmax],col="red")
par(op)

Answer 2

您可以按不同的标准定义山谷：

深度
宽度
体积（深度*宽度）

你可能在一座大山上也有山谷，你也想要这些吗？

例如，这里有一个山谷：1 2 3 4 1000 1000 800 800 800 1000 1000 500 200 3

尝试更详细地解释您（或您所在领域的任何专家）如何根据数据选择山谷

您可能需要查看watershed

Answer 3

您可能需要尝试peak detection function来识别感兴趣的区域。可以使用span参数指定所需的最小谷宽度。

首先平滑数据，摆脱像蓝色图形右边“谷”中的噪声峰值可能是个好主意。一个简单的stats::filter就足够了。

最后一步是检查找到的“山谷”的深度。这实际上取决于您的要求。作为第一个近似值，您可以简单地将峰值与数据的中间值进行比较。

如何使用Perl / R在图表中找到低区域？

3 个答案: