查找具有特定模式的共享ID

时间:2017-02-03 11:16:31

标签: python

我有一个带有100k行的制表符分隔文件:(已编辑)

PROT1B2 PROT1A1
PROT1A1 PROT1B2  
PROT1A5 PROT1B6   
PROT2A1 PROT2B2   
PROT1A2 PROT3B2
PROT3B2 PROT1A2

我希望ID两个方向匹配并且有1A-1B,1B-1A,保存在一个文件中,其余的保存在另一个文件中,所以:

out.txt) PROT1B2 PROT1A1     rest.txt) PROT1A5  PROT1B6 
         PROT1A1 PROT1B2               PROT2A1  PROT2B2
                                       PROT1A2 PROT3B2
                                       PROT3B2 PROT1A2

我的脚本为我提供了双向ID,但我不知道要找到具体的模式,重新?如果您对您的脚本发表评论我很感激,所以我可以理解并修改它。

fileA = open("input.txt",'r')
fileB = open("input_copy.txt",'r')
output = open("out.txt",'w')
out2=open("rest.txt",'w')
dictA = dict()
for line1 in fileA:
    new_list=line1.rstrip('\n').split('\t')
    query=new_list[0]
    subject=new_list[1]
    dictA[query] = subject
dictB = dict()
for line1 in fileB:
    new_list=line1.rstrip('\n').split('\t')
    query=new_list[0]
    subject=new_list[1]
    dictB[query] = subject
SharedPairs ={}
NotSharedPairs ={}
for id1 in dictA.keys():
    value1=dictA[id1]
    if value1 in dictB.keys():
        if id1 == dictB[value1]: # may be re should go here?
            SharedPairs[value1] = id1
        else:
            NotSharedPairs[value1] = id1
for key in SharedPairs.keys():
    line = key +'\t' + SharedPairs[key]+'\n'
    output.write(line)
for key in NotSharedPairs.keys():
    line = key +'\t' + NotSharedPairs[key]+'\n'
    out2.write(line)

1 个答案:

答案 0 :(得分:1)

对于最终的规范,这里有答案建议:

a_file_name = "input.txt"
a_dict, b_dict = {}, {}
with open(a_file_name, "rt") as i_f:  # read file in context secured block
    for line in i_f:
        line = line.rstrip()  # no trailing white space (includes \n et al.)
        if line:
            pair = [a.strip() for a in line.split('\t')]  # fragile split
            a_dict[pair[0]] = pair[1]
            b_dict[pair[1]] = pair[0]

must_match = sorted(('1A', '1B'))  # Accept only unordered 1A-1B pairs
fix_pos_slice = slice(4, 6)  # Sample: 'PROT1A1' has '1A' on slice(4, 6)

shared, rest = {}, {}
for key_a, val_a in a_dict.items():
    # Below we prepare a canidate for matching against the unordered pair
    fix_pos_match_cand = sorted(x[fix_pos_slice] for x in (key_a, val_a))
    if must_match == fix_pos_match_cand and b_dict.get(key_a) == val_a:
        shared[val_a] = key_a
    else:
        rest[val_a] = key_a

# Output shared and rest into corresponding files
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
    with open(f_name, 'wt') as o_f:  # Again secured in context block
        for key, val in data.items():
            o_f.write(key + '\t' + val + '\n')

对给定(新)input.txt进行操作(期望一行中的单词之间的标签!):

PROT1B2 PROT1A1
PROT1A1 PROT1B2  
PROT1A5 PROT1B6   
PROT2A1 PROT2B2   
PROT1A2 PROT3B2
PROT3B2 PROT1A2

收益率为out.txt

PROT1A1 PROT1B2
PROT1B2 PROT1A1

rest.txt

PROT1B6 PROT1A5
PROT2B2 PROT2A1
PROT3B2 PROT1A2
PROT1A2 PROT3B2

已添加评论以突出显示某些代码部分。

根据特殊要求,草图如何扩展到更多对以匹配:

要使用相同的输入文件(但演示不同的结果),在允许的匹配中添加一个假设的1A-3B对(输入中不存在2A-2B)就像这样(一个解决方案):

a_file_name = "input.txt"
a_dict, b_dict = {}, {}
with open(a_file_name, "rt") as i_f:  # read file in context secured block
    for line in i_f:
        line = line.rstrip()  # no trailing white space (includes \n et al.)
        if line:
            pair = [a.strip() for a in line.split('\t')]  # fragile split
            a_dict[pair[0]] = pair[1]
            b_dict[pair[1]] = pair[0]

must_match_once = sorted(  # Accept only unordered 1A-1B or 1A-3B pairs
    (sorted(pair) for pair in (('1A', '1B'), ('1A', '3B'))))
fix_pos_slice = slice(4, 6)  # Sample: 'PROT1A1' has '1A' on slice(4, 6)

shared, rest = {}, {}
for key_a, val_a in a_dict.items():
    # Below we prepare a canidate for matching against the unordered pair
    fix_pos_match_cand = sorted(x[fix_pos_slice] for x in (key_a, val_a))
    has_match = any(
        [must_match == fix_pos_match_cand for must_match in must_match_once])
    if has_match and b_dict.get(key_a) == val_a:
        shared[val_a] = key_a
    else:
        rest[val_a] = key_a

# Output shared and rest into corresponding files
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
    with open(f_name, 'wt') as o_f:  # Again secured in context block
        for key, val in data.items():
            o_f.write(key + '\t' + val + '\n')

对给定(相同)input.txt进行操作(注意一行中单词之间仍然需要标签!):

PROT1B2 PROT1A1
PROT1A1 PROT1B2  
PROT1A5 PROT1B6   
PROT2A1 PROT2B2   
PROT1A2 PROT3B2
PROT3B2 PROT1A2

收益率为out.txt

PROT1A1 PROT1B2
PROT1B2 PROT1A1
PROT3B2 PROT1A2
PROT1A2 PROT3B2

rest.txt

PROT1B6 PROT1A5
PROT2B2 PROT2A1

非常特殊地请求草图如何也不影响部分重复对:

像往常一样,现实生活会引入"重复",所以根据OP的特殊要求, 这里有一个最终变体,处理重复的第一或第二列标记。

也可以将整个对存储为一组中的字符串,但保持输入的dict(使用漂亮的setdefault方法并定位值列表)是一种更适合的方法IMO。

在下面的变体中,另外两件事情与其他事情有所不同:

  1. 输出作为成对(在元组中)附加到列表
  2. 收集
  3. 输出现在有源列排序(其他变体以相反的方式执行)
  4. 处理的样本(部分)重复数据:

    PROT1A1 PROT1B1
    PROT1A1 PROT2B1
    

    源代码:

    a_file_name = "input.txt"
    a_dict, b_dict = {}, {}
    with open(a_file_name, "rt") as i_f:  # read file in context secured block
        for line in i_f:
            line = line.rstrip()  # no trailing white space (includes \n et al.)
            if line:
                pair = [a.strip() for a in line.split('\t')]  # fragile split
                # Build a dict with list as values
                # ... to keep same key, different value pairs
                a_dict.setdefault(pair[0], []).append(pair[1])
                b_dict.setdefault(pair[1], []).append(pair[0])
    
    must_match_once = sorted(  # Accept only unordered 1A-1B or 2A-2B pairs
        (sorted(pair) for pair in (('1A', '1B'), ('1A', '3B'))))
    fix_pos_slice = slice(4, 6)  # Sample: 'PROT1A1' has '1A' on slice(4, 6)
    
    shared, rest = [], []  # Store respective output in lists of pairs (tuples)
    for key_a, seq_val_a in a_dict.items():
        for val_a in seq_val_a:
            # Below we prepare a canidate for matching against the unordered pair
            fix_pos_mc = sorted(x[fix_pos_slice] for x in (key_a, val_a))
            has_match = any(
                [must_match == fix_pos_mc for must_match in must_match_once])
            if has_match and b_dict.get(key_a) and val_a in b_dict.get(key_a):
                # Preserve first, second source order by appending in that order
                shared.append((key_a, val_a))
            else:
                rest.append((key_a, val_a))
    
    # Output shared and rest into corresponding files
    for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
        with open(f_name, 'wt') as o_f:  # Again secured in context block
            for pair in data:
                o_f.write('\t'.join(pair) + '\n')
    

    对给定(退化)input.txt进行操作(注意仍然期望一行中单词之间的标签!):

    PROT1B2 PROT1A1
    PROT1A1 PROT1B2  
    PROT1A1 PROT2B2  
    PROT1A5 PROT1B6   
    PROT2A1 PROT2B2   
    PROT1A2 PROT3B2
    PROT3B2 PROT1A2
    

    收益率为out.txt

    PROT1B2 PROT1A1
    PROT1A1 PROT1B2
    PROT1A2 PROT3B2
    PROT3B2 PROT1A2
    

    rest.txt

    PROT1A1 PROT2B2
    PROT1A5 PROT1B6
    PROT2A1 PROT2B2
    

    因此,"重复":PROT1A1 PROT2B2会被保留,并且不会影响搜索到的PROT1A1 PROT1B2

    以前的更新

    现在通过评论更好地指定任务(也许还可以更新问题):

    a_file_name = "input.txt"
    a_dict, b_dict = {}, {}
    fix_pos_match_query = sorted(('1A', '1B'))
    fix_pos_slice = slice(4, 6)  # sample 'PROT1A1' has '1A' on slice(4, 6)
    with open(a_file_name, "rt") as i_f:
        for line in i_f:
            line = line.rstrip('\n')
            if line:
                pair = [a.strip() for a in line.split('\t')]
                fix_pos_match_cand = sorted(x[fix_pos_slice] for x in pair)
                if fix_pos_match_query == fix_pos_match_cand:
                    a_dict[pair[0]] = pair[1]
                    b_dict[pair[1]] = pair[0]
    
    shared, rest = {}, {}
    for key_a, val_a in a_dict.items():
        if b_dict.get(key_a) == val_a:
            shared[val_a] = key_a
        else:
            rest[val_a] = key_a
    
    for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
        with open(f_name, 'wt') as o_f:
            for key, val in data.items():
                o_f.write(key + '\t' + val + '\n')
    

    按给定的input.txt操作:

    PROT1B2 PROT1A1
    PROT1A1 PROT1B2 
    PROT1A5 PROT1B6  
    PROT2A1 PROT2B2
    

    收益率为out.txt

    PROT1A1 PROT1B2
    PROT1B2 PROT1A1
    

    rest.txt

    PROT1B6 PROT1A5
    

    这可以通过从样本和评论中理解的确切位置预期给定的2个字符来起作用。

    如果这是具有进一步变异的蛋白质结构域特定任务(例如,不匹配固定位置),则正则表达式肯定会更好,但问题不能提供这样的可变性。如果需要,只需在输入行上替换过滤器,使用正则表达式匹配(或不匹配)的过滤器。

    <强>旧

    使用简单的完全匹配过滤器的第一个答案:

    一个答案的试验 - 或者至少有一些关于如何编写代码的提示,以及其他人可读的代码 - 以及基于输入时所请求的特定对的存在的简单过滤器:

    a_file_name = "input.txt"
    a_dict, b_dict = {}, {}
    filter_on_read = sorted(('PROT1B2', 'PROT1A1'))
    with open(a_file_name, "rt") as i_f:
        for line in i_f:
            line = line.rstrip('\n')
            if line:
                pair = [a.strip() for a in line.split('\t')]
                if filter_on_read == sorted(pair):
                    a_dict[pair[0]] = pair[1]
                    b_dict[pair[1]] = pair[0]
    
    shared, rest = {}, {}
    for key_a, val_a in a_dict.items():
        if b_dict.get(key_a) == val_a:
            shared[val_a] = key_a
        else:
            rest[val_a] = key_a
    
    for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
        with open(f_name, 'wt') as o_f:
            for key, val in data.items():
                o_f.write(key + '\t' + val + '\n')
    

    在我的机器上给出:

    PROT1B2 PROT1A1
    PROT1A1 PROT1B2 
    PROT1A5 PROT1B6  
    PROT2A1 PROT2B2
    

    收益率为out.txt

    PROT1A1 PROT1B2
    PROT1B2 PROT1A1
    

    并且在rest.txt中没有(对于此输入),因为其余部分已经过滤掉了。

    请注意:特别是会有更优雅的版本。当阅读大文件时...我们可能会错过数据,因为我们只是 - 像问题代码一样 - 在映射的一侧循环从第一个到第二个条目的映射,如果有重复的第一个条目但是具有不同的第二个条目将覆盖先前读取的数据,前者将永远不会输出。

    因此b_dict中的条目可能会在输出文件中看不到。

    HTH python是一种很棒的语言; - )