Question

我有基因列表（作为床文件）和全基因组RNA-seq数据集（也存储为床文件）。我目前正在尝试开发一个python脚本，它允许我将读取计数从上游500bp提取到转录起始位点下游的2000bp，即基因的开头，并将这些值存储在一个数组中以备将来使用。

目前，我的脚本如下所示：

feature_genes=np.zeros((6576, 2501))

for lines in feature:
   for i in range(0,6575):
        if line[5]==lines[5] and line[5]=='+' and line[0]==lines[0] and int(lines[1])>=int(line[1])- 500 and int(lines[1])<=int(line[1])+2000:
            feature_genes[i][int(lines[1])-int(line[1])+500]=lines[4] 
        elif line[5]=='-' and line[0]==lines[0] and int(lines[2])+500>=int(line[2]) and int(lines[2])-2000<=int(line[2]) and lines[5]=='-':
            feature_genes[i][-1*(int(lines[2])-int(line[2])-500)]=lines[4]

其中feature表示我在bedfile中的读数和基因我的基因列表，每行包含特定核苷酸的读数（这是链特异性信息，不包括未观察到读数的任何碱基对）或位置一个基因分别。

NB。 .bed文件的格式如下：

Position 

0 chromosome
1 transcription start site
2 transcription termination site
3 feature name
4 read count
5 strand

有人能想到一种有效的方法吗？我的代码需要永远运行（python newbie）。

Answer 1

简单的答案是不使用python，而是使用bedtools。有几种方法可以做到这一点，这里有一个：

1）将你的TSS上游扩展x个核苷酸，下游扩展x个核苷酸，这样数学就已经得到了解。

2）使用带有abam选项的intersectBed来输出覆盖感兴趣区域的RNA-Seq读数（或者如果你只想要覆盖深度，则覆盖BET）

RNA-seq数据与特定基因的关联

1 个答案: