在Python的文本语料库中优化正则表达式搜索

时间:2015-02-15 23:28:36

标签: python regex

这是我在python中的要求:

我确实加载了一个字典 - 比如来自/ usr / share / dict / words(不要与字典类型混淆)并使用它来搜索有效的单词。目前我正在做如下:

dict_list  = open('dictionary', 'r').read().split()

def search_dictionary(key):
    p=re.compile(key)
    # Comment: Is 'key' a prefix for a valid word in dictionary?
    # If yes, return True, else return False
    tmp_list = [x for x in dict_list if bool(p.match(x))]
    if not tmp_list:
        ....
    else:
        ....

请注意,search_dictionary可以被调用很多次,这是目前的瓶颈。有没有更有效的方法来进行此字符串搜索?比如说预编译字典。人们可以想到一个字典攻击用例。我是一个相对先发的人。

编辑:我已经用评论更新了代码。正如评论中所建议的那样,我可能做的工作超出了需要。

2 个答案:

答案 0 :(得分:3)

您的算法在O(n)时间内以大常量运行。当一个简单的二进制搜索可以做O(lg n)时,这似乎是错误的。如果你的正则表达式不包含特殊字符,为什么不:

import bisect
with open('dictionary') as f:
    dictionary = f.read().split()

    # .sort is slow, so better to sort
    # words on disk! And/or run many searches
    # for one invocation
    dictionary.sort()

def bisect_search(key):
    i = bisect.bisect_left(dictionary, key)
    if i != len(dictionary):
        return dictionary[i].startswith(key)

    return False

对数组进行排序,然后按字典顺序找到"最小的" word >= key的单词,并查看它是否是给定密钥的前缀。

对Padraic的第一封字典进行速度测试,然后是线性搜索:

In [1]: %timeit bisect_search('thu')
1000000 loops, best of 3: 1.07 µs per loop

In [2]: %timeit search_dictionary('thu')
1000 loops, best of 3: 595 µs per loop

包含146880个单词,以及t开头的5760个单词。

答案 1 :(得分:2)

如果你想在第一场比赛中停止使用any,一旦我们找到匹配项就会短路,在你的代码中你总是会查看字典中的每个单词,即使你得到了匹配首先,你还要不必要地建立一个清单:

dict_list  = open('dictionary').read().split()

def search_dictionary(key):
    p = re.compile(key)
    if any(p.match(x) for x in dict_list):
    .....

您也应该最好只在每次调用该功能时创建一次字典。在代码的开头定义它,并在需要时将其作为参数传递。

如果您想使用str.startswith查找前缀可能会更快:

if any(x.startswith(key) for x in dict_list): 

优化来电:

check = str.startswith
if any(check(x, key) for x in dict_list):

或者如果它可以出现在任何地方,只需使用:

if any(key in x for x in dict_list): 

使用优化的str.startswith方法使用cpython2.7似乎更有效:

In [15]: s ="efficient"

In [16]: timeit p.match(s)
1000000 loops, best of 3: 359 ns per loop

In [17]: check = str.startswith

In [18]: timeit check(s,"eff")
1000000 loops, best of 3: 212 ns per loop

非匹配的差异大致相同

如果您从字典中创建一个实际的字典,其中的键来自az,并且值是以键开头的单词列表,您可以使用函数中每个key的第一个字母进行查找只搜索以相同字母开头的单词。

from collections import defaultdict
word_dict = defaultdict(list)

with open("/usr/share/dict/words") as f:
    for line in f:
        line = line.rstrip().lower()
        word_dict[line[0]].append(line)

您可以看到示例输出使用键" z":

word_dict["z"]
['z', "z's", 'zachariah', "zachariah's", 'zachary', "zachary's", 'zachery', "zachery's", 'zagreb', "zagreb's", 'zaire', "zaire's", 'zairian', 'zambezi', "zambezi's", 'zambia', "zambia's", 'zambian', "zambian's", 'zambians', 'zamboni', 'zamenhof', "zamenhof's", 'zamora', 'zane', "zane's", 'zanuck', "zanuck's", 'zanzibar', "zanzibar's", 'zapata', 'zaporozhye', 'zapotec', 'zappa', "zappa's", 'zara', "zara's", 'zebedee', 'zechariah', 'zedekiah', "zedekiah's", 'zedong', "zedong's", 'zeffirelli', "zeffirelli's", 'zeke', "zeke's", 'zelig', 'zelma', "zelma's", 'zen', "zen's", 'zenger', "zenger's", 'zeno', "zeno's", 'zens', 'zephaniah', 'zephyrus', 'zeppelin', 'zest', "zest's", 'zeus', "zeus's", 'zhengzhou', 'zhivago', "zhivago's", 'zhukov', 'zibo', "zibo's", 'ziegfeld', 'ziegler', "ziegler's", 'ziggy', "ziggy's", 'zimbabwe', "zimbabwe's", 'zimbabwean', "zimbabwean's", 'zimbabweans', 'zimmerman', "zimmerman's", 'zinfandel', "zinfandel's", 'zion', "zion's", 'zionism', "zionism's", 'zionisms', 'zionist', "zionist's", 'zionists', 'zions', 'ziploc', 'zn', "zn's", 'zoe', "zoe's", 'zola', "zola's", 'zollverein', 'zoloft', 'zomba', "zomba's", 'zorn', 'zoroaster', "zoroaster's", 'zoroastrian', "zoroastrian's", 'zoroastrianism', "zoroastrianism's", 'zoroastrianisms', 'zorro', "zorro's", 'zosma', "zosma's", 'zr', "zr's", 'zsigmondy', 'zubenelgenubi', "zubenelgenubi's", 'zubeneschamali', "zubeneschamali's", 'zukor', "zukor's", 'zulu', "zulu's", 'zulus', 'zuni', 'zwingli', "zwingli's", 'zworykin', 'zyrtec', "zyrtec's", 'zyuganov', "zyuganov's", 'zürich', "zürich's", 'z', 'zanier', 'zanies', 'zaniest', 'zaniness', "zaniness's", 'zany', "zany's", 'zap', "zap's", 'zapped', 'zapping', 'zaps', 'zeal', "zeal's", 'zealot', "zealot's", 'zealots', 'zealous', 'zealously', 'zealousness', "zealousness's", 'zebra', "zebra's", 'zebras', 'zebu', "zebu's", 'zebus', 'zed', "zed's", 'zeds', 'zenith', "zenith's", 'zeniths', 'zephyr', "zephyr's", 'zephyrs', 'zeppelin', "zeppelin's", 'zeppelins', 'zero', "zero's", 'zeroed', 'zeroes', 'zeroing', 'zeros', 'zest', "zest's", 'zestful', 'zestfully', 'zests', 'zeta', 'zigzag', "zigzag's", 'zigzagged', 'zigzagging', 'zigzags', 'zilch', "zilch's", 'zillion', "zillion's", 'zillions', 'zinc', "zinc's", 'zinced', 'zincing', 'zincked', 'zincking', 'zincs', 'zing', "zing's", 'zinged', 'zinger', "zinger's", 'zingers', 'zinging', 'zings', 'zinnia', "zinnia's", 'zinnias', 'zip', "zip's", 'zipped', 'zipper', "zipper's", 'zippered', 'zippering', 'zippers', 'zippier', 'zippiest', 'zipping', 'zippy', 'zips', 'zircon', "zircon's", 'zirconium', "zirconium's", 'zircons', 'zit', "zit's", 'zither', "zither's", 'zithers', 'zits', 'zodiac', "zodiac's", 'zodiacal', 'zodiacs', 'zombi', "zombi's", 'zombie', "zombie's", 'zombies', 'zombis', 'zonal', 'zone', "zone's", 'zoned', 'zones', 'zoning', 'zonked', 'zoo', "zoo's", 'zoological', 'zoologist', "zoologist's", 'zoologists', 'zoology', "zoology's", 'zoom', "zoom's", 'zoomed', 'zooming', 'zooms', 'zoos', 'zucchini', "zucchini's", 'zucchinis', 'zwieback', "zwieback's", 'zygote', "zygote's", 'zygotes']

因此,使用word_dict[key]获取val以获取适当的值:

def search_dictionary(key):
    check = str.startswith
    vals = word_dict[key[0]]
    if any(check(x, key) for x in vals):

不确定您是否考虑过案例,因此您可能希望根据需要删除较低的电话。