查找两个文件之间的公共元素

时间:2014-06-26 09:21:49

标签: python

我有两个不同的文件如下: file1.txt是制表符分隔的

AT5G54940.1 3182
            pfam
            PF01253 SUI1#Translation initiation factor SUI1
            mf
            GO:0003743  translation initiation factor activity
            GO:0008135  translation factor activity, nucleic acid binding
            bp
            GO:0006413  translational initiation
            GO:0006412  translation
            GO:0044260  cellular macromolecule metabolic process
GRMZM2G158629_P02   4996
                pfam
                PF01575 MaoC_dehydratas#MaoC like domain
                mf
                GO:0016491  oxidoreductase activity
                GO:0033989  3alpha,7alpha,
OS08T0174000-01 560919

和包含不同蛋白质名称的file2.txt,

GRMZM2G158629_P02
AT5G54940.1
OS05T0566300-01
OS08T0174000-01

我需要运行一个程序,它会从file1中找到file2中存在的蛋白质名称,但也会打印出与该蛋白质相关的所有“GO:”(如果适用)。对我来说困难的部分是解析第一个文件..格式很奇怪。我试过这样的事情,但是非常感谢任何其他方式,

import re
with open('file2.txt') as mylist:                                                      
proteins = set(line.strip() for line in mylist)                         

with open('file1.txt') as mydict:                           
    with open('a.txt', 'w') as output:                  
        for line in mydict:                                 
            new_list = line.strip().split()                         
            protein = new_list[0]                               
            if protein in proteins:
                if re.search(r'GO:\d+', line):
                    output.write(protein+'\t'+line)

所需的输出,无论哪种格式都可以,只要我拥有所有相应的GO

AT5G54940.1 GO:0003743  translation initiation factor activity
            GO:0008135  translation factor activity, nucleic acid binding
            GO:0006413  translational initiation
            GO:0006412  translation
            GO:0044260  cellular macromolecule metabolic process
GRMZM2G158629_P02   GO:0016491  oxidoreductase activity
                    GO:0033989  3alpha,7alpha,
OS08T0174000-01

2 个答案:

答案 0 :(得分:2)

只是为了让您知道如何解决这个问题。属于输入文件中一个蛋白质的“组”由从缩进行到非缩进行的更改分隔。搜索此转换并获得您的组(或“块”)。组的第一行包含蛋白质名称。所有其他行可能是GO:行。

您可以使用if line.startswith(" ")(而非" "来检测缩进,而不是"\t",具体取决于您的输入文件格式。

def get_protein_chunks(filepath):
    chunk = []
    last_indented = False
    with open(filepath) as f:
        for line in f:
            if not line.startswith(" "):
                current_indented = False
            else:
                current_indented = True
            if last_indented and not current_indented:
                yield chunk
                chunk = []       
            chunk.append(line.strip())
            last_indented = current_indented


look_for_proteins = set(line.strip() for line in open('file2.txt'))


for p in get_protein_chunks("input.txt"):
    proteinname = p[0].split()[0]
    proteindata = p[1:]
    if proteinname not in look_for_proteins:
        continue
    print "Protein: %s" % proteinname
    golines = [l for l in proteindata if l.startswith("GO:")]
    for g in golines:
        print g

在这里,一个块只是一个剥离线列表。我用生成器从输入文件中提取蛋白质块。如您所见,逻辑仅基于从缩进线到非缩进线的转换。

使用生成器时,您可以根据需要使用数据。我只是打印出来。但是,您可能希望将数据放入字典中并进行进一步分析。

输出:

$ python test.py 
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,

答案 1 :(得分:1)

一种选择是建立一个列表字典,使用蛋白质的名称作为关键:

#!/usr/bin/env python

import pprint
pp = pprint.PrettyPrinter()

proteins = set(line.strip() for line in open('file2.txt'))
d = {}

with open('file1.txt') as file:
    for line in file:
        line = line.strip()
        parts = line.split()

        if parts[0] in proteins:
            key = parts[0]            
            d[key] = []                            
        elif parts[0].split(':')[0] == 'GO':
            d[key].append(line)

pp.pprint(d)

我已经使用pprint模块打印字典,正如您所说,您对格式并不太挑剔。现有的输出是:

{'AT5G54940.1': ['GO:0003743  translation initiation factor activity',
                 'GO:0008135  translation factor activity, nucleic acid binding',
                 'GO:0006413  translational initiation',
                 'GO:0006412  translation',
                 'GO:0044260  cellular macromolecule metabolic process'],
 'GRMZM2G158629_P02': ['GO:0016491  oxidoreductase activity',
                       'GO:0033989  3alpha,7alpha,']}

修改

您可以使用循环获取问题中指定的输出,而不是使用pprint

with open('out.txt', 'w') as out:    
    for k,v in d.iteritems():        
        out.write('Protein: {}\n'.format(k))
        out.write('{}\n'.format('\n'.join(v)))

out.txt

Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process