我正在寻找有关以下搜索脚本的建议。任何帮助都会很棒。
以下行是我输入(查询)文件的示例(“out.list.txt”)
IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK
我可以在对齐文件(“out.test.txt”)中找到此行和50,000个其他行并打印输出。 这是对齐文件的摘录。
Query_13 388 IVVQADGSQVVEDRKADVMNAAYNALQAGLRTIKVGNTNT*VTEVMNKAIEPFECNMLEG 567
c18644_g2_i1_3 122 LVVGASAETPITGNKADVVLAAYNAIQAALRLIKPGNSNLEVTEVFNKIATDYQCNVLEG 181
c18644_g1_i1_2 121 LVVGATAEAPIAGNKADVTLAAYNAIQAALRLIKPGSTNTEVTQVFNKIAADYHCNVLEG 180
c11476_g1_i1_2 119 VVVQ-DPSAKVTGEKADLLLAALNAMQAALRLVRPGNTNTQVTEAMSKIAEAYGCTMLEG 177
c7710_g1_i1_1 147 IVVSEKADAVVEGRKADVVHAAYNALQVALRLLKPGQKNNDVTEHIAKVVESYKCNPVEG 206
c37_g1_i1_3 145 VVVGKDKSTGAEGRKAEVILAAYNALQASLRHLRPGSKNYDVTETVEKISETFGCNPVEG 204
c2897_g1_i1_3 144 FILGATAENPASGKKADVILAAKQAIDAAVRKIRVGETNLTLTETIARVAAAYGVNSVEG 203
c4999_g1_i1_2 167 VVI---GKEKVDDKRADVVKCAWDAAEAALRLVQVGNTNTQVTEAFTKIADEYGCKPMQG 223
如果查询行包含'*',是否可以记录输出的其他行上该位置的内容?即。 E,E,Q,d,d,T,V
到目前为止所有的尝试都没有成功,我想知道我的尝试是否可行。
seq_list = open("out.list.txt")
query_sequences = []
for sequence in seq_list:
query_sequences.append(seq_list.strip())
seq_list.close()
hits = []
alignments = open("out.test.txt")
for line in alignments:
alignment_hit = line.split()
for query_sequence in query_sequences:
if query_sequence in alignment_hit:
hits.append(line)
break
alignments.close()
答案 0 :(得分:1)
sequence = open("out.list.txt").read() # reads in the file as a string
alignment_rows = open("out.test.txt").readlines() # reads in the file as a list of lines
# split each row by tab sign "\t" and extract sequences only - third column
# I assume, you're using tab sign as a separator in your alignment
alignment_sequences = [ row.split("\t")[2] for row in alignment_rows ]
output = {} # this is a dict, where keys are indices of positions with * and values are lists e.g. {1: ['A', 'C'], 2: ['D', 'E']}
for index, char in enumerate(sequence):
if char == "*":
output[index] = []
for alignment_sequence in alignment sequences:
output[index].append(alignment_sequence[index])
答案 1 :(得分:1)
如果您只想要对齐序列字符,请尝试此操作(每行还处理多个*
)
lines = [line.rstrip() for line in open('out.test.txt')]
for line in lines:
data = line.split()
sequence = data[2]
if data[0].startswith("Query"):
star_indicies = [i for i,c in enumerate(sequence) if c == '*']
else:
print(list(sequence[star_index] for star_index in star_indicies))
样本输入的输出
['E']
['E']
['Q']
['D']
['D']
['T']
['Q']