如果你有一个名单。 。 。
query = ['link','zelda','saria','ganon','volvagia']
和文件中的行列表
data = ['>link is the first','OIGFHFH','AGIUUIIUFG','>peach is the second',
'AGFDA','AFGDSGGGH','>luigi is the third','SAGSGFFG','AFGDFGDFG',
'DSGSFGAAA','>ganon is the fourth','ADGGHHHHHH','>volvagia is the last',
'AFGDAAFGDA','ADFGAFD','ADFDFFDDFG','AHUUERR','>ness is another','ADFGGGGH',
'HHHDFDA']
你怎么能看到所有以'>'开头的行?然后,如果他们有一个名称name_list,那么包括'>'的行以及它后面的序列(后面的序列总是在上面)在两个单独的列表中
#example output file
name_list = ['>link is the first','>ganon is the fourth','>volvagia is the last']
seq_list = ['OIGFHFHAGIUUIIUFG','ADGGHHHHHH','AFGDAAFGDAADFGAFDADFDFFDDFGAHUUERR']
我宁愿不使用字典来执行此操作,因为我已被提示在类似情况下执行
所以我到目前为止的是:
for line,name in zip(data,query):
if bool(line[0] == '>' and re.search(name,line))==True:
#but then i'm stuck because len(query) and len(data) are not equal
....任何帮助将非常感谢``
答案 0 :(得分:1)
result = []
names = ['link', 'zelda', 'saria', 'ganon', 'volvagia']
lines = iter(data)
for line in lines:
while line.startswith(">") and any(name in line for name in names):
name = line
upper_seq = []
for line in lines:
if not line.isupper():
break
upper_seq.append(line)
else:
line = "" # guard against infinite loop at EOF
result.append((name, ''.join(upper_seq)))
如果名称很多,那么set()
可能更快找到排队名称而不是any(...)
:
names = set(names)
# ...
if line.startswith(">") and names.intersection(line[1:].split()):
# ...
[('>link is the first', 'OIGFHFHAGIUUIIUFG'),
('>ganon is the fourth', 'ADGGHHHHHH'),
('>volvagia is the last', 'AFGDAAFGDAADFGAFDADFDFFDDFGAHUUERR')]
答案 1 :(得分:0)
使用列表理解
print [line for line in lines if line.startswith(">") and set(my_words).intersection(line[1:].split())]
这将分解为for循环,如下所示
matched_line = []
for line in lines:
if line.startswith(">") and set(my_words).intersection(line[1:].split()):
matched_lines.append(line)
使用集合交集应该明显快于循环遍历列表中的每个单词并查看它是否在字符串中
>>> print [line for line in data if line.startswith(">") and set(query).intersection(line[1:].split())]
['>link is the first', '>ganon is the fourth', '>volvagia is the last']
答案 2 :(得分:0)
有更优雅的方法可以做到这一点,但我认为这种方法可能是您最容易理解的方法:
>>> found_lines = []
>>> sequences = []
>>> for line in data:
... if line.startswith(">"):
... for name in query:
... if name in line:
... found_lines.append(line)
... else:
... sequences.append(line)
>>> print found_lines
['>link is the first', '>ganon is the fourth', '>volvagia is the last']
>>>
始终从简单开始,并思考问题。你需要做的第一件事是什么?您想循环遍历data
(for line in data
)中的每一行。
对于每一行,您要检查它是否以>
开头。 (if line.startswith(">")
)。如果它不是以该字符开头,那么我们可以假设它是一个“序列”,并将其添加到sequences
列表(sequences.append(line)
)
如果是,则要检查query
中的任何名称是否出现在该行中。最简单的方法是什么?循环遍历每个名称(for name in query
),并自行检查(if name in line
)