我有两个不同的文件如下: file1.txt是制表符分隔的
AT5G54940.1 3182
pfam
PF01253 SUI1#Translation initiation factor SUI1
mf
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
bp
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
GRMZM2G158629_P02 4996
pfam
PF01575 MaoC_dehydratas#MaoC like domain
mf
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
OS08T0174000-01 560919
和包含不同蛋白质名称的file2.txt,
GRMZM2G158629_P02
AT5G54940.1
OS05T0566300-01
OS08T0174000-01
我需要运行一个程序,它会从file1中找到file2中存在的蛋白质名称,但也会打印出与该蛋白质相关的所有“GO:”(如果适用)。对我来说困难的部分是解析第一个文件..格式很奇怪。我试过这样的事情,但是非常感谢任何其他方式,
import re
with open('file2.txt') as mylist:
proteins = set(line.strip() for line in mylist)
with open('file1.txt') as mydict:
with open('a.txt', 'w') as output:
for line in mydict:
new_list = line.strip().split()
protein = new_list[0]
if protein in proteins:
if re.search(r'GO:\d+', line):
output.write(protein+'\t'+line)
所需的输出,无论哪种格式都可以,只要我拥有所有相应的GO
AT5G54940.1 GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
GRMZM2G158629_P02 GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
OS08T0174000-01
答案 0 :(得分:2)
只是为了让您知道如何解决这个问题。属于输入文件中一个蛋白质的“组”由从缩进行到非缩进行的更改分隔。搜索此转换并获得您的组(或“块”)。组的第一行包含蛋白质名称。所有其他行可能是GO:行。
您可以使用if line.startswith(" ")
(而非" "
来检测缩进,而不是"\t"
,具体取决于您的输入文件格式。
def get_protein_chunks(filepath):
chunk = []
last_indented = False
with open(filepath) as f:
for line in f:
if not line.startswith(" "):
current_indented = False
else:
current_indented = True
if last_indented and not current_indented:
yield chunk
chunk = []
chunk.append(line.strip())
last_indented = current_indented
look_for_proteins = set(line.strip() for line in open('file2.txt'))
for p in get_protein_chunks("input.txt"):
proteinname = p[0].split()[0]
proteindata = p[1:]
if proteinname not in look_for_proteins:
continue
print "Protein: %s" % proteinname
golines = [l for l in proteindata if l.startswith("GO:")]
for g in golines:
print g
在这里,一个块只是一个剥离线列表。我用生成器从输入文件中提取蛋白质块。如您所见,逻辑仅基于从缩进线到非缩进线的转换。
使用生成器时,您可以根据需要使用数据。我只是打印出来。但是,您可能希望将数据放入字典中并进行进一步分析。
输出:
$ python test.py
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
答案 1 :(得分:1)
一种选择是建立一个列表字典,使用蛋白质的名称作为关键:
#!/usr/bin/env python
import pprint
pp = pprint.PrettyPrinter()
proteins = set(line.strip() for line in open('file2.txt'))
d = {}
with open('file1.txt') as file:
for line in file:
line = line.strip()
parts = line.split()
if parts[0] in proteins:
key = parts[0]
d[key] = []
elif parts[0].split(':')[0] == 'GO':
d[key].append(line)
pp.pprint(d)
我已经使用pprint
模块打印字典,正如您所说,您对格式并不太挑剔。现有的输出是:
{'AT5G54940.1': ['GO:0003743 translation initiation factor activity',
'GO:0008135 translation factor activity, nucleic acid binding',
'GO:0006413 translational initiation',
'GO:0006412 translation',
'GO:0044260 cellular macromolecule metabolic process'],
'GRMZM2G158629_P02': ['GO:0016491 oxidoreductase activity',
'GO:0033989 3alpha,7alpha,']}
您可以使用循环获取问题中指定的输出,而不是使用pprint
:
with open('out.txt', 'w') as out:
for k,v in d.iteritems():
out.write('Protein: {}\n'.format(k))
out.write('{}\n'.format('\n'.join(v)))
out.txt
:
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process