如何在python中更快地进行搜索?

时间:2013-06-08 19:01:31

标签: python performance search int

我正在从另一个文件中的一个文件中搜索一个值。确切的值只会在搜索文件中出现一次。如何更快地完成此过程?这是我目前的代码:

filltaxlist = open("file with query number.txt", "rw")
fulltaxa = open("output file with hit line match", "rw")

for line in filltaxalist:
    line = line.strip()
    taxid = re.split("\t", line)
    lookup = taxid[5] # this value is a number and I need the exact match only so I convert it to an integer
    int1 = int(lookup)
    for line in open("File to search.txt", "r"):
        data = re.split(',', line)
        hit = int(data[0]) # every value in this file is a number separated by a ,
        if lookup in line:
            if int1 == hit:
                fulltaxa.write(line)

这很好用,因为它写得很慢。我正在搜索的文件大小超过GB。在

filltaxlist行的示例:

cvvel_1234    403454663    29.43    3e-30    55.55555555234    1172189
cvell_1444    2342333      30.00    1e-50    34.34584359345    5911
cvell_1444    234230055    23.23    1e-60    32.23445983454    46245
cvell_1444    233493003    23.44    1e-43    35.23595604593    46245

fulltaxa应返回的内容:

1172189, 5943, 1002030, 12345
5911, 11234, 112356, 234, 3456, 44568, 78356
46245, 123, 3432456, 123488976, 23564, 334
46245, 123, 3432456, 123488976, 23564, 334

2 个答案:

答案 0 :(得分:4)

使用数据库

正如其他人所提到的,最简单的方法可能是将其转储到数据库中(例如sqllite)。如果需要与语言接口,可以使用python绑定。

纯Python解决方案

您完全为fulltaxa中的每个条目阅读了filltaxlist(由于嵌套的顺序),首先缓存所有查询会更高效,然后阅读fulltaxa一次只有,然后对输出进行排序以重新获得fulltaxa的顺序。

由于查询的顺序是导入的,我们应该使用FIFO结构 - deque在我们的情况下会做得很好。

from collections import defaultdict
filltaxlist = open("file with query number.txt", "rw")
fulltaxa = open("output file with hit line match", "rw")

possibles = {}
for i, line in enumerate(filltaxalist):
    line = line.strip()
    taxid = re.split("\t", line)
    lookup = taxid[5] # this value is a number and I need the exact match only so I covert it to an integer
    int1 = int(lookup)
    possibles[int1] = i

output_lines = defaultdict(list)
for line in open("File to search.txt", "r"):
    data = re.split(',', line)
    hit = int(data[0]) # every value in this file is a number separated by a ,
    if hit in possibles:
        output_lines[possibles[hit]].append(line)

fulltaxa.writelines(line for lines in output_lines.values() for line in lines)

当您用完查询时,上面的代码将抛出一个IndexError

其他一些小改进。

data = re.split(',', line)

可能比

data = line.split(',')

但是您应该进行分析以确保在您的情况下这是意味着什么。

答案 1 :(得分:1)

您的算法是O(m * n)。可以通过使用字典来制作O(m + n)算法。即使m很小,它也可能是Python的重大改进,其中字典访问的常数因素与任何其他语句没有太大差别。

filltaxalist = open("file with query number.txt", "rw")
fulltaxa = open("output file with hit line match", "rw")

filltaxadict = {}
for i, line in enumerate(filltaxalist):
    line = line.strip()
    taxid = re.split("\t", line)
    lookup = taxid[5] # this value is a number and I need the exact match only so I convert it to an integer
    int1 = int(lookup)

    filltaxadict[int1] = i

results = [[]] * len(filltaxadict)
for line in open("File to search.txt", "r"):
    data = re.split(',', line)
    hit = int(data[0]) # every value in this file is a number separated by a ,
    match = filltaxadict.get(hit)
    if match is not None:
        results[match].append(line)

for result in results:
    fulltaxa.writelines(result)

以正确的顺序处理重复项;如果你不需要,会稍微简单一些。要搜索的文件可能很大;这不会将其内容保留在内存中,只是(一部分)filltaxalist的内容,我认为这不是非常大。