使用另一个文件python的内容搜索文件

时间:2013-02-18 21:34:38

标签: python file list search strip

我有一个文件,每行都有唯一的ID号。我试图在不同的文件中搜索这些ID号的出现,并返回这些id号在第二个文件中的行,在本例中为输出文件。我是编程新手,这是我到目前为止所做的。

outlist = []
with open('readID.txt', 'r') as readID, \
     open('GOlines.txt', 'w') as output, \
     open('GO.txt', 'r') as GO:  
     x = readID.readlines()
     print x
     for line in GO:
        if x[1:-1] in line:
        outlist.append(line)
        outlist.append('\n')

     if x[1:-1] in line:
        outlist.append(line)
        outlist.append('\n')
     print outlist
     output.writelines(outlist)

文件如下所示:readID.txt

00073810.1
00082422.1
00018647.1
00063072.1

GO.txt

#query  GO  reference DB    reference family    
HumanDistalGut_READ_00048904.2  GO:0006412  TIGRFAM TIGR00001    
HumanDistalGut_READ_00043244.3  GO:0022625  TIGRFAM TIGR00001    
HumanDistalGut_READ_00048644.4  GO:0000315  TIGRFAM TIGR00001   
HumanDistalGut_READ_00067264.5  GO:0003735  TIGRFAM TIGR00001

读取ID与 READ 之后的一些但不是所有ID匹配......

3 个答案:

答案 0 :(得分:0)

#!/usr/bin/env python
# encoding: utf-8

import sys
import re

def extract_id(line):
    """
    input: HumanDistalGut_READ_00048904.2  GO:0006412  TIGRFAM TIGR00001
    returns: 00048904.2
    """
    result = re.search(r'READ_(\d{8}\.\d)', line)
    if result != None:
        return result.group(1)
    else:
        return None

def extract_go_num(line):
    """
    input: HumanDistalGut_READ_00048904.2  GO:0006412  TIGRFAM TIGR00001
    returns: 0006412
    """
    result = re.search(r'GO:(\d{7})', line)
    if result != None:
        return result.group(1)
    else:
        return None

def main(argv = None):
    if argv is None:
        argv = sys.argv

    with open('readID.txt', 'r') as f:
        ids = frozenset(f.readlines())

    with open('GO.txt', 'r') as haystack, \
        open('GOLines.txt', 'w') as output:

        for line in haystack:
            if extract_id(line) in ids:
                output.write(extract_go_num(line) + '\n')

if __name__ == "__main__":
    sys.exit(main())

我正在为O(n)解决方案而不是O(n ^ 2)交换内存开销。

我正在使用正则表达式来提取ID并输入数字,但如果数字位数发生变化则会很脆弱。

答案 1 :(得分:0)

也许是这样的:

with open('readID.txt', 'r') as readID, open('GOlines.txt', 'w') as output, open('GO.txt', 'r') as GO:
    for ID in readID:
        for line in GO:
            if ID in line:
                output.write(line)

答案 2 :(得分:0)

如果您的文件足够小,可以放入记忆中。

with open('/somepath/GO.txt') as f:
    pool = f.readlines()

with open('/somepath/readID.txt') as f:    
    tokens = f.readlines()

# strip spaces/new lines
tokens = [t.strip() for t in tokens]
found = [(t, lno) for t in tokens for (lno, l) in enumerate(pool) if t in l]

然后,您可以将found列表打印到您的文件中。