Question

概述：我想找出长度为3-15个字符的50,000个“单词”中至少有一个存在于1亿个“句子”的数据库中，长度为50到1200个字符，没有空格但有换行符。< / p>

（为什么？这是一个蛋白质组学项目。“单词”是肽序列，例如MRQNTWAAV，句子是完整的蛋白质序列，例如MRQNTWAAVTGGQTNRALI ......有蛋白质组学工具可以进行搜索，但它们的效率会更低，因为它们是针对长查询字符串和非完全匹配进行优化的。）

另外，我将在常规PC上执行此操作，8 GB RAM。

我是python的新手，通过交易成为科学家，而不是程序员;我写了一个脚本，但它很慢（在我看来）。由于我只想知道哪些条款至少存在一次，我以为我会加快速度：

将参考数据库分成200个500,000个句子
迭代这些部分数据库，使用mmain
将查询字词列表加载到内存列表中
使用mmain的find（当然不是正则表达式）迭代列表，并将未找到的术语写入新的查询术语列表
当循环进入下一个数据库时，创建查询术语的较短文件的新列表
等

这是我的代码：正如我所说，我不是程序员，所以我知道它不是最优的。它确实适用于削减样本集。如果有一些基本的设计功能可以帮助它更快地运行（我不在乎它是否需要一夜之间，但我希望它不会花费几天......我承认我还没有系统地计时它。）

我立即想到的一些事情： - 数据库文件大于或小于50 MB会更优化吗？ - 我确定我应该在内存中保留“未找到”术语列表，只在进程结束时将其写入磁盘。我是这样做的，所以我可以在这个设计阶段评估这个过程。

import os
import mmap
import glob

os.chdir("C:/mysearch/")
searchtermfile = "original_search_terms.txt"

# load list of 50,000 search terms into memory as a list
with open(searchtermfile, 'r') as f:
    searchtermlist = [line.strip() for line in f]
    numberofsearchterms = len(searchtermlist)


#make a list of database files in the directory
dblist = glob.glob('databasepart*.txt') 
sizedblist = len(dblist)

counterdb = 0 #counts the iterations over the database files
countersearchterms = 0 #counts the iterations over the search terms
previousstring = "DUMMY" #a dummy value just for the first time it's used

#iterate first over list of file names
for nameoffile in dblist:
    counterdb += 1
    countersearchterms = 0
    #remove old notfound list, this iteration will make a new, shorter one.
    os.remove("notfound.txt") #returns an error if there is not already a notfound.txt file; I always make sure there's an empty file with that name
    #read current database file (50 MB) into memory
    with open(nameoffile, 'r+b') as f:
        m = mmap.mmap(f.fileno(), 0) #Size 0 reads entire file into memory
        #iterate over search terms
        for searchstring in searchtermlist:
            countersearchterms += 1
            if m.find(searchstring) == -1:
                with open("notfound.txt", "a") as myfile:
                    myfile.write(searchstring + "\n")
            #this print line won't be there in the final code, it's allowing me to see how fast this program runs
            print str(counterdb) + " of " + str(sizedblist) + " & " + str(countersearchterms) + " of " + str(numberofsearchterms)
            previousstring = searchstring
        m.close()
    #reload saved list of not found terms as new search term list
    with open('notfound.txt', 'r') as f:
        searchtermlist = [line.strip() for line in f]
        numberofsearchterms = len(searchtermlist)

Answer 1

也许您可以尝试使用正则表达式：

>>> searchterms = ["A", "B", "AB", "ABC", "C", "BC"]
>>> # To match longest sequences first, yes need to place them at the beginning
>>> searchterms.sort(key=len, reverse=True)
>>> searchterms
['ABC', 'AB', 'BC', 'A', 'B', 'C']
>>> # Compile a big regex searching all terms together
>>> _regex =re.compile("("+"|".join(searchterms)+")")
>>> _regex.findall("ABCBADCBDACBDACBDCBADCBADBCBCBDACBDACBDACBDABCDABC")
['ABC', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'C', 'B', 'A', 'C', 'B', 'A', 'BC', 'BC', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'ABC', 'ABC']
>>>

如果您只想计算比赛数，可以使用finditer。

Answer 2

我对python的经验较少，所以我个人会用C或C ++来做。问题很简单，因为您只是在寻找完全匹配。

内循环是所有时间花费的地方，所以我会专注于此。

首先，我将获取5e4术语列表，对它们进行排序，将它们放在一个用于二进制搜索的表中，或者（更好的是）将它们放在trie结构中以进行逐字母搜索。

然后，在“句子”中的每个字符位置，调用搜索功能。它应该很快。原则上，哈希表将具有O（1）性能，但常量因子很重要。我敢打赌，在这种情况下，特里仍然会打败它，你可以调整日光。

优化大规模搜索速度的策略

2 个答案: