我进行了RNA-Seq分析,我使用的GTF文件来自Ensembl。 cuffdiff的输出用XLOC&替换了Ensembl ID,尽管它也输出了基因名称(例如MX2)。 Ensembl ID不再存在。
我看到一篇文章显示使用此python脚本修改merged.gtf
http://seqanswers.com/forums/showthread.php?t=51071
#!/usr/bin/python
gtf_handle = "/PATH/TO/merged.gtf"
fh = open(gtf_handle, "r")
import re
trans_ids = {}
with open('merged2.gtf', 'w') as f:
for line in fh:
line = line.strip('\n') ##strip the line to remove white spaces
##print line
cuffID = re.findall(r'gene_id \"([\w\.]+)"', line) ##use RE to get lists of cuffid, ensemblId etc
cuffTx = re.findall(r'transcript_id \"([\w\.]+)"', line)
ensemblTx = re.findall(r'oId \"([\w\.]+)"', line)
geneName = re.findall(r'gene_name \"([\w\.]+)"', line)
##print cuffTx[0]
line = str(line).replace(cuffTx[0], ensemblTx[0]) ##unlist the transcript identifiers and replace cufflinksID with ensemblIDs
print line
f.write("%s\n" % str(line)) ##write file out to a .gtf file`
我遵循了这个脚本,但收到错误:
File "modify_merge.py", line 12
for line in fh:
IndentationError: expected an indented block
答案 0 :(得分:0)
正如@ppperry评论的那样,缩进在Python中很重要:
#!/usr/bin/python
gtf_handle = "/PATH/TO/merged.gtf"
fh = open(gtf_handle, "r")
import re
trans_ids = {}
with open('merged2.gtf', 'w') as f:
for line in fh:
line = line.strip('\n') ##strip the line to remove white spaces
##print line
cuffID = re.findall(r'gene_id \"([\w\.]+)"', line) ##use RE to get lists of cuffid, ensemblId etc
cuffTx = re.findall(r'transcript_id \"([\w\.]+)"', line)
ensemblTx = re.findall(r'oId \"([\w\.]+)"', line)
geneName = re.findall(r'gene_name \"([\w\.]+)"', line)
##print cuffTx[0]
line = str(line).replace(cuffTx[0], ensemblTx[0]) ##unlist the transcript identifiers and replace cufflinksID with ensemblIDs
print line
f.write("%s\n" % str(line)) ##write file out to a .gtf file