我必须使用2mln行循环遍历大文件,看起来像这样
P61981 1433G_HUMAN
P61982 1433G_MOUSE
Q5RC20 1433G_PONAB
P61983 1433G_RAT
P68253 1433G_SHEEP
目前我有以下函数,它接受列表中的每个条目,如果这个大文件中的条目 - 它占用了出现的行,但它很慢(~10分钟)。可能由于循环方案,可以请您建议优化吗?
up = "database.txt"
def mplist(somelist):
newlist = []
with open(up) as U:
for row in U:
for i in somelist:
if i in row:
newlist.append(row)
return newlist
somelist
somelist = [
'P68250',
'P31946',
'Q4R572',
'Q9CQV8',
'A4K2U9',
'P35213',
'P68251'
]
答案 0 :(得分:6)
如果您的somelist
仅包含在第一列中找到的值,则拆分该行,仅针对set
测试第一个值,而不是list
:
def mplist(somelist):
someset = set(somelist)
with open(up) as U:
return [line for line in U if line.split(None, 1)[0] in someset]
对集合的测试是O(1)常数时间操作(与集合的大小无关)。
演示:
>>> up = '/tmp/database.txt'
>>> open(up, 'w').write('''\
... P61981 1433G_HUMAN
... P61982 1433G_MOUSE
... Q5RC20 1433G_PONAB
... P61983 1433G_RAT
... P68253 1433G_SHEEP
... ''')
>>> def mplist(somelist):
... someset = set(somelist)
... with open(up) as U:
... return [line for line in U if line.split(None, 1)[0] in someset]
...
>>> mplist(['P61981', 'Q5RC20'])
['P61981 1433G_HUMAN\n', 'Q5RC20 1433G_PONAB\n']
你可能想要返回一个生成器,而只是过滤器,而不是在内存中建立一个列表:
def mplist(somelist):
someset = set(somelist)
with open(up) as U:
return (line for line in U if line.split(None, 1)[0] in someset)
您可以循环,但不能将此结果编入索引:
for match in mplist(somelist):
# do something with match
并且不需要在内存中保存所有匹配的条目。