Question

问题：给定一组~250000整数用户ID，以及大约1 TB的JSON格式的每行记录，将用户ID匹配的记录加载到数据库中。

所有记录中只有约1％将与250000个用户ID匹配。而不是JSON解码每个记录，这需要很长时间，我试图使用字符串匹配来确定用户ID是否在原始JSON中;如果匹配，则解码JSON并检查记录然后插入。

问题是将一个原始JSON字符串与包含~250k字符串条目的集合进行匹配的速度很慢。

到目前为止，这是代码：

// get the list of integer user IDs
cur.execute('select distinct user_id from users')

// load them as text into a set
users = set([])
for result in cur.fetchall():
    users.add(str(result[0]))

// start working on f, the one-json-record-per-line text file
for line in f:
    scanned += 1
    if any(user in line for user in users):
        print "got one!"
        // decode json
        // check for correct decoded user ID match
        // do insert

我正在以正确的方式接近这个？什么是匹配这些字符串的更快的方法？目前，当寻找这么多用户ID时，这在3ghz机器上每秒管理~2个条目（不太好）。当用户ID列表非常短时，它可以管理~200000条/秒。

Answer 1

Aho-Corasick似乎是为此目的而构建的。甚至还有一个方便的Python模块（easy_install ahocorasick）。

import ahocorasick

# build a match structure
print 'init empty tree'
tree = ahocorasick.KeywordTree()

cur.execute('select distinct user_id from users')

print 'add usernames to tree'
for result in cur.fetchall():
   tree.add(str(result[0]))

print 'build fsa'
tree.make()

for line in f:
     scanned += 1
     if tree.search(line) != None:
         print "got one!"

这更接近每秒约450个条目。

Answer 2

尝试反转匹配算法：

for digit_sequence in re.findall('[0-9]+', line):
    if digit_sequence in users:
        ...

Answer 3

我是C ++自由职业者，我的客户通常都是初学者，他们有一些慢速python / java / .net / etc代码，他们希望它运行得更快。我通常可以使它快x100倍。就在最近，我有类似的问题任务：用TB级的文本数据实现500万个子串的搜索。

我测试了几种算法。对于Aho-Corasick，我使用了开源http://sourceforge.net/projects/multifast/。这不是最快的算法。最快的是我的算法，我从哈希表的混合和从Rabin-Karp搜索算法获得的一些想法中编造。这个简单的算法快了x5倍，使用的内存比AC少x5倍。平均子串长度为32个字节。因此，AC可能不是最快的算法。

在文本中搜索一长串子字符串

3 个答案: