Question

我们说我有一个类似2,000个关键字的数据库，每个关键字都会映射到几个常见的变体

例如：

 "Node" : ["node.js", "nodejs", "node js", "node"] 

 "Ruby on Rails" : ["RoR", "Rails", "Ruby on Rails"]

我想搜索字符串（ok，文档）并返回所有包含的关键字列表。

我知道我可以循环进行大量的regex次搜索，但有没有更有效的方法呢？近似的东西＆＃34;实时＆＃34;或接近实时的网络应用程序？

我目前正在查看Elastic Search文档，但我想知道是否有Pythonic方法来实现我的结果。

我对regex非常熟悉，但我现在不想写这么多正则表达式。我将非常感谢您的回答，或者您是否可以指出我正确的方向。

Answer 1

你可以使用一个反转这个关键字词典的数据结构 - 这样每个["node.js", "nodejs", "node js", "node", "Node"]都是一个值为“Node”的键 - 其他2000个关键词的10个左右的变种点其中一个关键字 - 所以一个20000大小的字典，这并不多。

使用taht dict，您可以将文本重新标记为仅由关键字的规范化形式组成，然后它们继续计数。

 primary_dict = {
     "Node" : ["node.js", "nodejs", "node js", "node", "Node"] 

      "Ruby_on_Rails" : ["RoR", "Rails", "Ruby on Rails"]
 }

def invert_dict(src):
    dst = {}
    for key, values in src.items():
        for value in values:
            dst[value] = key
    return dst

words = invert_dict(primary_dict)
from collections import Counter

def count_keywords(text):
    counted = Counter()
    for word in text.split(): # or use a regex to split on punctuation signs as well
        counted[words.get(word, None)] += 1
    return counted

至于效率，这种方法相当不错，因为文本上的每个单词只会在字典上查找一次，而Python的字典搜索是O（log（n）） - 它会给你一个O（ n log（n））方法。正如你所想的那样尝试一个单超级正则表达式将是O（n²），无论正则表达式匹配的速度有多快（并且与dict查找相比并不快）。

如果文本太长，可能使用简单的拆分（或正则表达式）对其进行预标记是不可行的 - 在这种情况下，您可以每次只读取一段文本并将其中的小块分成单词

其他方法

由于您不需要每个单词的计数，另一种方法是使用文档中的单词和列表中的所有关键字创建Python集，然后取两个集的交集。您只能计算此交集的关键字与上面words倒置词典的关联。

<强>捕捉这些都没有考虑包含空格的术语 - 我总是认为单词可以被标记为单独匹配，但str.split和简单的标点删除正则表达式无法解释像'ruby on rails'这样的组合术语和'节点js'。如果没有其他的解决方法，而不是'拆分'，你将不得不编写一个custon tokenizer，它可以尝试在整个文本中对着倒置的dict匹配一个，两个和三个单词的集合。

Answer 2

用于标记长字符串的另一种方法是构造单个综合正则表达式，然后使用命名组来标识标记。它需要一些设置，但识别阶段被推入C /本机代码，只需一次通过，因此它可以非常有效。例如：

import re

tokens = {
    'a': ['andy', 'alpha', 'apple'],
    'b': ['baby']
}

def create_macro_re(tokens, flags=0):
    """
    Given a dict in which keys are token names and values are lists
    of strings that signify the token, return a macro re that encodes
    the entire set of tokens.
    """
    d = {}
    for token, vals in tokens.items():
        d[token] = '(?P<{}>{})'.format(token, '|'.join(vals))
    combined = '|'.join(d.values())
    return re.compile(combined, flags)

def find_tokens(macro_re, s):
    """
    Given a macro re constructed by `create_macro_re()` and a string,
    return a list of tuples giving the token name and actual string matched
    against the token.
    """
    found = []
    for match in re.finditer(macro_re, s):
        found.append([(t, v) for t, v in match.groupdict().items() if v is not None][0])
    return found

最后一步，运行它：

macro_pat = create_macro_re(tokens, re.I)
print find_tokens(macro_pat, 'this is a string of baby apple Andy')

macro_pat最终对应于：

re.compile(r'(?P<a>andy|alpha|apple)|(?P<b>baby)', re.IGNORECASE)

第二行打印一个元组列表，每个元组都给出令牌和与令牌匹配的实际字符串：

[('b', 'baby'), ('a', 'apple'), ('a', 'Andy')]

此示例显示如何将令牌列表编译为单个正则表达式，并且可以在单个传递中有效地针对字符串运行。

左派未示出的是它的一大优势：不仅可以通过字符串定义标记，还可以通过正则表达式定义标记。因此，如果我们想要b令牌的替代拼写，例如，我们不必详尽地列出它们。正常的正则表达式模式就足够了。说我们也想要认识到＆＃39; babby＆＃39;作为b令牌。我们可以像以前一样'b': ['baby', 'babby']，或者我们可以使用正则表达式执行相同的操作：'b': ['babb?y']。或'bab+y'如果您还想包含任意内部＆＃39; b＆＃39;字符。

在Python中高效地进行字符串搜索

2 个答案: