使用numpy随机分布基因组特征的DNA序列读数

时间:2014-02-01 13:10:31

标签: python random numpy sequence

您好我已经编写了一个脚本,可以随机对其映射到的基因上的读取序列进行混洗。 如果您想确定您在您感兴趣的基因上观察到的峰值是否具有统计学意义,这将非常有用。我使用此代码计算我感兴趣的基因中的峰的假发现率。 代码下方:

import numpy as np
import matplotlib.pyplot as plt
iterations = 1000 # number of times a read needs to be shuffled
featurelength = 1000  # length of the gene
a = np.zeros((iterations,featurelength))  # create a matrix with 1000 rows of the feature length
b = np.arange(iterations)                 # a matrix with the number of iterations (0-999)
reads = np.random.randint(10,50,1000)     # a random dataset containing an array of DNA read lengths

在代码下方填充大矩阵(a):

for i in reads:               # for read with read length i
    r = np.random.randint(-i,featurelength-1,iterations) # generate random read start positions for the read i
    for j in b:               # for each row in a:
        pos = r[j]            # get the first random start position for that row
        if pos < 0:           # start position can be negative because a read does not have to completely overlap with the feature
            a[j][:pos+i]+=1
        else:
            a[j][pos:pos+i]+=1  # add the read to the array and repeat

然后生成热图以查看分布是否大致均匀:

plt.imshow(a)
plt.show()

这会产生所需的结果,但由于有许多for循环,因此速度非常慢。 我试图做一些花哨的numpy索引,但我经常得到“太多索引错误”。

有人更了解如何做到这一点吗?

1 个答案:

答案 0 :(得分:0)

花式索引有点棘手,但仍有可能:

for i in reads:
    r = np.random.randint(-i,featurelength-1,iterations)
    idx = np.clip(np.arange(i)[:,None]+r, 0, featurelength-1)
    a[b,idx] += 1

为了解构这一点,我们是:

  1. 创建一个简单的索引数组作为列向量,从0到i:np.arange(i)[:,None]

  2. 添加r(行向量)中的每个元素,广播以制作大小为(i,iterations)的矩阵,并将正确的偏移量放入a列。

  3. 通过[0,featurelength)将指数限制在np.clip范围内。

  4. 最后,我们为每一行(a)和相关列(b)指定idx