在python中搜索和记录文本文件

时间:2015-12-15 19:21:52

标签: python bioinformatics

我正在寻找有关以下搜索脚本的建议。任何帮助都会很棒。

以下行是我输入(查询)文件的示例(“out.list.txt”)

IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK

我可以在对齐文件(“out.test.txt”)中找到此行和50,000个其他行并打印输出。 这是对齐文件的摘录。

Query_13               388   IVVQADGSQVVEDRKADVMNAAYNALQAGLRTIKVGNTNT*VTEVMNKAIEPFECNMLEG  567
c18644_g2_i1_3         122   LVVGASAETPITGNKADVVLAAYNAIQAALRLIKPGNSNLEVTEVFNKIATDYQCNVLEG  181
c18644_g1_i1_2         121   LVVGATAEAPIAGNKADVTLAAYNAIQAALRLIKPGSTNTEVTQVFNKIAADYHCNVLEG  180
c11476_g1_i1_2         119   VVVQ-DPSAKVTGEKADLLLAALNAMQAALRLVRPGNTNTQVTEAMSKIAEAYGCTMLEG  177
c7710_g1_i1_1          147   IVVSEKADAVVEGRKADVVHAAYNALQVALRLLKPGQKNNDVTEHIAKVVESYKCNPVEG  206
c37_g1_i1_3            145   VVVGKDKSTGAEGRKAEVILAAYNALQASLRHLRPGSKNYDVTETVEKISETFGCNPVEG  204
c2897_g1_i1_3          144   FILGATAENPASGKKADVILAAKQAIDAAVRKIRVGETNLTLTETIARVAAAYGVNSVEG  203
c4999_g1_i1_2          167   VVI---GKEKVDDKRADVVKCAWDAAEAALRLVQVGNTNTQVTEAFTKIADEYGCKPMQG  223

如果查询行包含'*',是否可以记录输出的其他行上该位置的内容?即。 E,E,Q,d,d,T,V

到目前为止所有的尝试都没有成功,我想知道我的尝试是否可行。

seq_list = open("out.list.txt")

query_sequences = []

for sequence in seq_list:

    query_sequences.append(seq_list.strip())

seq_list.close()

hits = []

alignments = open("out.test.txt")

for line in alignments:

    alignment_hit = line.split()

    for query_sequence in query_sequences:

        if query_sequence in alignment_hit:

            hits.append(line)

            break

alignments.close()

2 个答案:

答案 0 :(得分:1)

sequence = open("out.list.txt").read() # reads in the file as a string

alignment_rows = open("out.test.txt").readlines() # reads in the file as a list of lines

# split each row by tab sign "\t" and extract sequences only - third column
# I assume, you're using tab sign as a separator in your alignment
alignment_sequences = [ row.split("\t")[2] for row in alignment_rows ]

output = {} # this is a dict, where keys are indices of positions with * and values are lists e.g. {1: ['A', 'C'], 2: ['D', 'E']}
for index, char in enumerate(sequence):
    if char == "*":
        output[index] = []
        for alignment_sequence in alignment sequences:
            output[index].append(alignment_sequence[index])

答案 1 :(得分:1)

如果您只想要对齐序列字符,请尝试此操作(每行还处理多个*

lines = [line.rstrip() for line in open('out.test.txt')]
for line in lines:
    data = line.split()
    sequence = data[2]
    if data[0].startswith("Query"):
        star_indicies = [i for i,c in enumerate(sequence) if c == '*']
    else:
        print(list(sequence[star_index] for star_index in star_indicies))

样本输入的输出

['E']
['E']
['Q']
['D']
['D']
['T']
['Q']