Question

我进行了RNA-Seq分析，我使用的GTF文件来自Ensembl。 cuffdiff的输出用XLOC＆替换了Ensembl ID，尽管它也输出了基因名称（例如MX2）。 Ensembl ID不再存在。

我看到一篇文章显示使用此python脚本修改merged.gtf

http://seqanswers.com/forums/showthread.php?t=51071

#!/usr/bin/python
gtf_handle = "/PATH/TO/merged.gtf"
fh = open(gtf_handle, "r")
import re
trans_ids = {}
with open('merged2.gtf', 'w') as f:
for line in fh:
line = line.strip('\n') ##strip the line to remove white spaces
##print line
cuffID = re.findall(r'gene_id \"([\w\.]+)"', line) ##use RE to get lists of cuffid, ensemblId etc
cuffTx = re.findall(r'transcript_id \"([\w\.]+)"', line)
ensemblTx = re.findall(r'oId \"([\w\.]+)"', line)
geneName = re.findall(r'gene_name \"([\w\.]+)"', line)
##print cuffTx[0]
line = str(line).replace(cuffTx[0], ensemblTx[0]) ##unlist the transcript identifiers and replace cufflinksID with ensemblIDs
print line
f.write("%s\n" % str(line)) ##write file out to a .gtf file`

我遵循了这个脚本，但收到错误：

File "modify_merge.py", line 12 
  for line in fh: 
IndentationError: expected an indented block

Answer 1

正如@ppperry评论的那样，缩进在Python中很重要：

#!/usr/bin/python
gtf_handle = "/PATH/TO/merged.gtf"
fh = open(gtf_handle, "r")
import re
trans_ids = {}
with open('merged2.gtf', 'w') as f:
    for line in fh:
        line = line.strip('\n') ##strip the line to remove white spaces
        ##print line
        cuffID = re.findall(r'gene_id \"([\w\.]+)"', line) ##use RE to get lists of cuffid, ensemblId etc
        cuffTx = re.findall(r'transcript_id \"([\w\.]+)"', line)
        ensemblTx = re.findall(r'oId \"([\w\.]+)"', line)
        geneName = re.findall(r'gene_name \"([\w\.]+)"', line)
        ##print cuffTx[0]
        line = str(line).replace(cuffTx[0], ensemblTx[0]) ##unlist the transcript identifiers and replace cufflinksID with ensemblIDs
        print line
        f.write("%s\n" % str(line)) ##write file out to a .gtf file

如何从Cuffdiff将XLOC ID更改为Ensembl ID

1 个答案: