Python:如何快速搜索集合中的子字符串?

时间:2013-10-03 08:25:12

标签: python performance set substring lookup

我有一个包含~300.000元组的集合

In [26]: sa = set(o.node for o in vrts_l2_5) 
In [27]: len(sa)
Out[27]: 289798
In [31]: random.sample(sa, 1)
Out[31]: [('835644', '4696507')]

现在我想基于公共子字符串查找元素,例如前4个'数字'(实际上元素是字符串)。这是我的方法:

def lookup_set(x_appr, y_appr):
    return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]

In [36]: lookup_set('6652','46529')
Out[36]: [('665274', '4652941'), ('665266', '4652956')]

是否有更高效,更快捷的方法?

7 个答案:

答案 0 :(得分:2)

您可以在O(log(n) + m)时间内执行此操作,其中n是元组数,m是匹配元组的数量,如果,您可以负担得起保留元组的两个排序副本。 排序本身将花费O(nlog(n)),即它将渐近然后你的天真方法,但如果你必须做一定数量的查询(超过log(n),这是它几乎肯定很小)它会得到回报。

这个想法是你可以使用二分法来找到具有正确的第一个值和正确的第二个值的候选者,然后将这些集合交叉。

但请注意,您需要进行一种奇怪的比较:您需要关注以给定参数开头的所有字符串。这只是意味着在搜索最右边的事件时,您应该使用9 s填充密钥。

完整的工作(尽管没有经过严格测试)代码:

from random import randint
from operator import itemgetter

first = itemgetter(0)
second = itemgetter(1)

sa = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]
f_sorted = sorted(sa, key=first)
s_sorted = sa
s_sorted.sort(key=second)
max_length = max(len(s) for _,s in sa)

# See: bisect module from stdlib
def bisect_right(seq, element, key):
    lo = 0
    hi = len(seq)
    element = element.ljust(max_length, '9')
    while lo < hi:
        mid = (lo+hi)//2
        if element < key(seq[mid]):
            hi = mid
        else:
            lo = mid + 1
    return lo


def bisect_left(seq, element, key):
    lo = 0
    hi = len(seq)
    while lo < hi:
        mid = (lo+hi)//2
        if key(seq[mid]) < element:
            lo = mid + 1
        else:
            hi = mid
    return lo


def lookup_set(x_appr, y_appr):
    x_left = bisect_left(f_sorted, x_appr, key=first)
    x_right = bisect_right(f_sorted, x_appr, key=first)
    x_candidates = f_sorted[x_left:x_right + 1]
    y_left = bisect_left(s_sorted, y_appr, key=second)
    y_right = bisect_right(s_sorted, y_appr, key=second)
    y_candidates = s_sorted[y_left:y_right + 1]
    return set(x_candidates).intersection(y_candidates)

与初始解决方案的比较:

In [2]: def lookup_set2(x_appr, y_appr):
   ...:     return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]

In [3]: lookup_set('123', '124')
Out[3]: set([])

In [4]: lookup_set2('123', '124')
Out[4]: []

In [5]: lookup_set('123', '125')
Out[5]: set([])

In [6]: lookup_set2('123', '125')
Out[6]: []

In [7]: lookup_set('12', '125')
Out[7]: set([('12478', '125908'), ('124625', '125184'), ('125494', '125940')])

In [8]: lookup_set2('12', '125')
Out[8]: [('124625', '125184'), ('12478', '125908'), ('125494', '125940')]

In [9]: %timeit lookup_set('12', '125')
1000 loops, best of 3: 589 us per loop

In [10]: %timeit lookup_set2('12', '125')
10 loops, best of 3: 145 ms per loop

In [11]: %timeit lookup_set('123', '125')
10000 loops, best of 3: 102 us per loop

In [12]: %timeit lookup_set2('123', '125')
10 loops, best of 3: 144 ms per loop

正如您所看到的,此解决方案比您的天真方法快240-1400倍(在这些示例中)。

如果你有一大堆比赛:

In [19]: %timeit lookup_set('1', '2')
10 loops, best of 3: 27.1 ms per loop

In [20]: %timeit lookup_set2('1', '2')
10 loops, best of 3: 152 ms per loop

In [21]: len(lookup_set('1', '2'))
Out[21]: 3587
In [23]: %timeit lookup_set('', '2')
10 loops, best of 3: 182 ms per loop

In [24]: %timeit lookup_set2('', '2')
1 loops, best of 3: 212 ms per loop

In [25]: len(lookup_set2('', '2'))
Out[25]: 33053

正如您所看到的,即使匹配数量约为总大小的10%,此解决方案也会更快。但是,如果您尝试匹配 all 数据:

In [26]: %timeit lookup_set('', '')
1 loops, best of 3: 360 ms per loop

In [27]: %timeit lookup_set2('', '')
1 loops, best of 3: 221 ms per loop

它变得(不是那么)慢,虽然这是一个非常特殊的情况,我怀疑你会经常匹配几乎所有的元素。

请注意,sort数据的时间非常短:

In [13]: from random import randint
    ...: from operator import itemgetter
    ...: 
    ...: first = itemgetter(0)
    ...: second = itemgetter(1)
    ...: 
    ...: sa2 = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]

In [14]: %%timeit
    ...: f_sorted = sorted(sa2, key=first)
    ...: s_sorted = sorted(sa2, key=second)
    ...: max_length = max(len(s) for _,s in sa2)
    ...: 
1 loops, best of 3: 881 ms per loop

正如您所看到的,执行两个已排序的副本只需不到一秒钟。实际上上面的代码会稍快一些,因为它会对第二个副本进行“就地”排序(虽然时间排序仍然需要O(n)内存)。

这意味着如果您必须执行超过6-8次查询,此解决方案将更快。


注意:python'd标准库提供了bisect模块。但是它不允许key参数(即使我记得读过Guido想要它,所以它可能在将来添加)。因此,如果你想直接使用它,你将不得不使用“decorate-sort-undecorate”习语。

而不是:

f_sorted = sorted(sa, key=first)

你应该这样做:

f_sorted = sorted((first, (first,second)) for first,second in sa)

即。您明确地插入键作为元组的第一个元素。之后,您可以使用('123', '')作为元素传递给bisect_*函数,它应该找到正确的索引。

我决定避免这个。我从模块的源代码中复制粘贴代码并稍加修改,以便为您的用例提供更简单的界面。


最后评论:如果你可以将元组元素转换为整数,那么比较会更快。但是,大部分时间仍然会用于执行集合的交集,因此我不确切知道它会提高多少性能。

答案 1 :(得分:1)

整数操作比字符串快得多。 (内存也较小)

因此,如果您可以比较整数,那么您将会更快。 我怀疑这样的事情对你有用:

sa = set(int(o.node) for o in vrts_l2_5) 

然后这可能适合你:

def lookup_set(samples, x_appr, x_len, y_appr, y_len):
    """

    x_appr == SSS0000  where S is the digit to search for
    x_len == number of digits to S (if SSS0000 then x_len == 4)
    """
    return ((x, y) for x, y in samples if round(x, -x_len) ==  x_appr and round(y, -y_len) == y_approx)

此外,它返回一个生成器,因此您不会立即将所有结果加载到内存中。

更新为使用Bakuriu提到的圆方法

答案 2 :(得分:1)

您可以使用trie data structure。可以使用dict对象树构建一个(参见How to create a TRIE in Python)但是有一个包marisa-trie通过绑定到c ++库来实现一个内存高效的版本

我以前没有使用过这个库,但是玩弄它,我得到了它的工作:

from random import randint
from marisa_trie import RecordTrie

sa = [(str(randint(1000000,9999999)),str(randint(1000000,9999999))) for i in range(100000)]
# make length of string in packed format big enough!
fmt = ">10p10p"
sa_tries = (RecordTrie(fmt, zip((unicode(first) for first, _ in sa), sa)),
            RecordTrie(fmt, zip((unicode(second) for _, second in sa), sa)))

def lookup_set(sa_tries, x_appr, y_appr):
    """lookup prefix in the appropriate trie and intersect the result"""
     return (set(item[1] for item in sa_tries[0].items(unicode(x_appr))) & 
             set(item[1] for item in sa_tries[1].items(unicode(y_appr))))

lookup_set(sa_tries, "2", "4")

答案 3 :(得分:1)

我经历了4个建议的解决方案,以比较它们的效率。我使用不同的前缀长度运行测试,以查看输入将如何影响性能。 trie和排序列表性能肯定对输入的长度敏感,随着输入变得越来越快(我认为它实际上对输出的大小敏感,因为输出随着前缀变长而变小)。但是,在所有情况下,有序集合解决方案肯定更快。

在这些时序测试中,sa中有20万个元组,每种方法有10个元组:

for prefix length 1
  lookup_set_startswith    : min=0.072107 avg=0.073878 max=0.077299
  lookup_set_int           : min=0.030447 avg=0.037739 max=0.045255
  lookup_set_trie          : min=0.111548 avg=0.124679 max=0.147859
  lookup_set_sorted        : min=0.012086 avg=0.013643 max=0.016096
for prefix length 2
  lookup_set_startswith    : min=0.066498 avg=0.069850 max=0.081271
  lookup_set_int           : min=0.027356 avg=0.034562 max=0.039137
  lookup_set_trie          : min=0.006949 avg=0.010091 max=0.032491
  lookup_set_sorted        : min=0.000915 avg=0.000944 max=0.001004
for prefix length 3
  lookup_set_startswith    : min=0.065708 avg=0.068467 max=0.079485
  lookup_set_int           : min=0.023907 avg=0.033344 max=0.043196
  lookup_set_trie          : min=0.000774 avg=0.000854 max=0.000929
  lookup_set_sorted        : min=0.000149 avg=0.000155 max=0.000163
for prefix length 4
  lookup_set_startswith    : min=0.065742 avg=0.068987 max=0.077351
  lookup_set_int           : min=0.026766 avg=0.034558 max=0.052269
  lookup_set_trie          : min=0.000147 avg=0.000167 max=0.000189
  lookup_set_sorted        : min=0.000065 avg=0.000068 max=0.000070

以下是代码:

import random
def random_digits(num_digits):
    return random.randint(10**(num_digits-1), (10**num_digits)-1)

sa = [(str(random_digits(6)),str(random_digits(7))) for _ in range(200000)]

### naive approach
def lookup_set_startswith(x_appr, y_appr):
    return [item for item in sa if item[0].startswith(x_appr) and item[1].startswith(y_appr) ]

### trie approach
from marisa_trie import RecordTrie

# make length of string in packed format big enough!
fmt = ">10p10p"
sa_tries = (RecordTrie(fmt, zip([unicode(first) for first, second in sa], sa)),
         RecordTrie(fmt, zip([unicode(second) for first, second in sa], sa)))

def lookup_set_trie(x_appr, y_appr):
 # lookup prefix in the appropriate trie and intersect the result
 return set(item[1] for item in sa_tries[0].items(unicode(x_appr))) & \
        set(item[1] for item in sa_tries[1].items(unicode(y_appr)))

### int approach
sa_ints = [(int(first), int(second)) for first, second in sa]

sa_lens = tuple(map(len, sa[0]))

def lookup_set_int(x_appr, y_appr):
    x_limit = 10**(sa_lens[0]-len(x_appr))
    y_limit = 10**(sa_lens[1]-len(y_appr))

    x_int = int(x_appr) * x_limit
    y_int = int(y_appr) * y_limit

    return [sa[i] for i, int_item in enumerate(sa_ints) \
        if (x_int <= int_item[0] and int_item[0] < x_int+x_limit) and \
           (y_int <= int_item[1] and int_item[1] < y_int+y_limit) ]

### sorted set approach
from operator import itemgetter

first = itemgetter(0)
second = itemgetter(1)

sa_sorted = (sorted(sa, key=first), sorted(sa, key=second))
max_length = max(len(s) for _,s in sa)

# See: bisect module from stdlib
def bisect_right(seq, element, key):
    lo = 0
    hi = len(seq)
    element = element.ljust(max_length, '9')
    while lo < hi:
        mid = (lo+hi)//2
        if element < key(seq[mid]):
            hi = mid
        else:
            lo = mid + 1
    return lo


def bisect_left(seq, element, key):
    lo = 0
    hi = len(seq)
    while lo < hi:
        mid = (lo+hi)//2
        if key(seq[mid]) < element:
            lo = mid + 1
        else:
            hi = mid
    return lo


def lookup_set_sorted(x_appr, y_appr):
    x_left = bisect_left(sa_sorted[0], x_appr, key=first)
    x_right = bisect_right(sa_sorted[0], x_appr, key=first)
    x_candidates = sa_sorted[0][x_left:x_right]
    y_left = bisect_left(sa_sorted[1], y_appr, key=second)
    y_right = bisect_right(sa_sorted[1], y_appr, key=second)
    y_candidates = sa_sorted[1][y_left:y_right]
    return set(x_candidates).intersection(y_candidates)     


####
# test correctness
ntests = 10

candidates = [lambda x, y: set(lookup_set_startswith(x,y)), 
              lambda x, y: set(lookup_set_int(x,y)),
              lookup_set_trie, 
              lookup_set_sorted]
print "checking correctness (or at least consistency)..."
for dlen in range(1,5):
    print "prefix length %d:" % dlen,
    for i in range(ntests):
        print " #%d" % i,
        prefix = map(str, (random_digits(dlen), random_digits(dlen)))
        answers = [c(*prefix) for c in candidates]
        for i, ans in enumerate(answers):
            for j, ans2 in enumerate(answers[i+1:]):
                assert ans == ans2, "answers for %s for #%d and #%d don't match" \
                                    % (prefix, i, j+i+1)
    print


####
# time calls
import timeit
import numpy as np

ntests = 10

candidates = [lookup_set_startswith,
              lookup_set_int,
              lookup_set_trie, 
              lookup_set_sorted]

print "timing..."
for dlen in range(1,5):
    print "for prefix length", dlen

    times = [ [] for c in candidates ]
    for _ in range(ntests):
        prefix = map(str, (random_digits(dlen), random_digits(dlen)))

        for c, c_times in zip(candidates, times):
            tstart = timeit.default_timer()
            trash = c(*prefix)
            c_times.append(timeit.default_timer()-tstart)
    for c, c_times in zip(candidates, times):
        print "  %-25s: min=%f avg=%f max=%f" % (c.func_name, min(c_times), np.mean(c_times), max(c_times))

答案 4 :(得分:0)

可能有,但不是非常多。 str.startswithand都是快捷操作符(一旦发现失败就可以返回),索引元组是一个快速操作。这里花费的大部分时间都来自对象查找,例如为每个字符串查找startswith方法。可能最有价值的选择是通过Pypy运行它。

答案 5 :(得分:0)

更快的解决方案是创建字典dict并将第一个值作为键,第二个值作为值。

  1. 然后,您将在dict的有序键列表中搜索与x_appr匹配的键(有序列表将允许您在键列表中优化搜索,例如二分法)。这将提供一个以例如k_list命名的密钥列表。

  2. 然后查找具有k_list中的密钥并匹配y_appr的dict的值。

  3. 您还可以在追加到k_list之前包含第二步(与y_appr匹配的值)。这样k_list将包含dict的正确元素的所有键。

答案 6 :(得分:0)

在这里,我只是比较了'in'方法和'find'方法:

CSV输入文件包含URL列表

# -*- coding: utf-8 -*-

### test perfo str in set

import re
import sys
import time
import json
import csv
import timeit

cache = set()

#######################################################################

def checkinCache(c):
  global cache
  for s in cache:
    if c in s:
      return True
  return False

#######################################################################

def checkfindCache(c):
  global cache
  for s in cache:
    if s.find(c) != -1:
      return True
  return False

#######################################################################

print "1/3-loading pages..."
with open("liste_all_meta.csv.clean", "rb") as f:
    reader = csv.reader(f, delimiter=",")
    for i,line in enumerate(reader):
      cache.add(re.sub("'","",line[2].strip()))

print "  "+str(len(cache))+" PAGES IN CACHE"

print "2/3-test IN..."
tstart = timeit.default_timer()
for i in range(0, 1000):
  checkinCache("string to find"+str(i))
print timeit.default_timer()-tstart

print "3/3-test FIND..."
tstart = timeit.default_timer()
for i in range(0, 1000):
  checkfindCache("string to find"+str(i))
print timeit.default_timer()-tstart

print "\n\nBYE\n"

以秒为单位的结果:

1/3-loading pages...
  482897 PAGES IN CACHE
2/3-test IN...
107.765980005
3/3-test FIND...
167.788629055


BYE

所以,'in'方法比'find'方法:)

玩得开心