Question

我正在建立一个后端并试图解决以下问题。

客户端向后端提交文字（平均约为2000个字符）
接收请求的后端端点必须将短语突出显示应用于提交的文本
有大约80k个词组要匹配。短语是一个简单的对象：
```
{
    'phrase': 'phrase to match'
    'link': 'link_url'
}
```
在找到文本中存在的所有短语匹配后，后端会将匹配的内容返回给客户端 - 基本上是地图：
```
range in text -> phrase
```

大部分都已完成。我即将解决短语匹配部分的编码问题。其他一切顺利。由于我不想重新发明轮子，我尝试使用谷歌搜索找到一个Python库，它可以有效地在文本中查找短语（来自巨大的列表）。但是，我找不到任何东西。

我查看了BlueSoup和Natural Language Toolkit。然而，他们似乎并没有做我正在寻找的事情。

你们知道是否有一个图书馆可以帮助完成这项任务吗？似乎是一种常见的实施方式，如果有一个完善的库，我也不想去定制。

Answer 1

为了在匹配80k模式时获得合理的速度，你肯定需要对模式进行一些预处理，像Boyer-Moore这样的单击算法将无济于事。

您可能还需要在已编译的代码（想想C扩展）中完成工作以获得合理的吞吐量。关于如何预处理模式 - 一个选项是状态机，如Aho-Corasick或一些通用finite state transducer。下一个选项类似于基于suffix array的索引，而我想到的最后一个选项是倒排索引。

如果您的匹配是精确的并且模式符合字边界，那么即使在纯Python中，一个良好实现的单词或word-ngram键控inverted index也可能足够快。索引不是一个完整的解决方案，它宁愿给你一些候选短语，你需要检查正常的字符串匹配以完成匹配。

如果你需要近似匹配，你可以选择字符ngram倒排索引。

关于实际实现 - 在其他答案中提到的flashtext似乎是一个合理的纯Python解决方案，如果你对完全短语限制没有问题。

否则，您可以使用通用的多模式regexp库获得合理的结果：其中最快的应该是英特尔的hyperscan - 甚至还有一些基本的python bindings可用。

其他选项是来自Facebook的Google RE2和Python bindings。在这种情况下，您想使用RE2::Set。

Answer 2

我在自己的聊天页面系统中遇到了几乎完全相同的问题。我希望能够添加指向文本中存在的多个关键字（略有变化）的链接。我只有大约200 phrases来检查。

我决定尝试使用标准的正则表达式来查看问题的速度。主要瓶颈在于构建正则表达式。我决定预先编译它，发现短文的匹配时间非常快。

以下方法采用phrases列表，其中每个包含phrase和link个键。它首先构造一个反向查找字典：

{'phrase to match' : 'link_url', 'another phrase' : 'link_url2'}

接下来，它以下面的形式编译正则表达式，这允许在单词之间包含不同数量的空格的匹配：

(phrase\s+to\s+match|another\s+phrase)

然后，对于每段文本（例如每个2000字），它使用finditer()来获得每个匹配。 match对象为您.span()提供匹配文本的开始和结束位置，group(1)给出匹配的文本。由于文本可能有额外的空格，因此首先应用re_whitespace将其删除并将其恢复为存储在reverse字典中的表单。这样，就可以自动查找所需的link：

import re

texts = ['this is a phrase   to    match', 'another phrase this  is']
phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}]

reverse = {d['phrase']:d['link'] for d in sorted(phrases, key=lambda x: x['phrase'])}
re_whitespace = re.compile(r'\s+')
re_phrases = re.compile('({})'.format('|'.join(d['phrase'].replace(' ', r'\s+') for d in phrases)))

for text in texts:
    matches = [(match.span(), reverse[re_whitespace.sub(' ', match.group(1))]) for match in re_phrases.finditer(text)]
    print(matches)

这将显示两个文本的匹配项：

[((0, 7), 'link_url2'), ((10, 30), 'link_url')]
[((15, 23), 'link_url2')]

为了测试这种缩放的方式，我通过从nltk导入英文单词列表并自动创建80,000两到六个单词短语以及唯一链接来测试它。然后我在两个适当长的文本上计时：

import re
import random
from nltk.corpus import words
import time

english = words.words()

def random_phrase(l=2, h=6):
    return ' '.join(random.sample(english, random.randint(l, h)))


texts = ['this is a phrase   to    match', 'another phrase this  is']
# Make texts ~2000 characters
texts = ['{} {}'.format(t, random_phrase(200, 200)) for t in texts]

phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}]
#Simulate 80k phrases
for x in range(80000):
    phrases.append({'phrase': random_phrase(), 'link': 'link{}'.format(x)})

construct_time = time.time()    

reverse = {d['phrase']:d['link'] for d in phrases}
re_whitespace = re.compile(r'\s+')
re_phrases = re.compile('({})'.format('|'.join(d['phrase'].replace(' ', r'\s+') for d in sorted(phrases, key=lambda x: len(x['phrase'])))))

print('Time to construct:', time.time() - construct_time)
print()

for text in texts:
    start_time = time.time()
    print('{} characters - "{}..."'.format(len(text), text[:60]))
    matches = [(match.span(), reverse[re_whitespace.sub(' ', match.group(1))]) for match in re_phrases.finditer(text)]
    print(matches)
    print('Time taken:', time.time() - start_time)        
    print()

这需要大约17秒来构造正则表达式和反向查找（只需要一次）。然后每个文本大约需要6秒钟。对于非常短的文本，每个文本需要约0.06秒。

Time to construct: 16.812477111816406

2092 characters - "this is a phrase   to    match totaquine externize intoxatio..."
[((0, 7), 'link_url2'), ((10, 30), 'link_url')]
Time taken: 6.000027656555176

2189 characters - "another phrase this  is political procoracoidal playstead as..."
[((15, 23), 'link_url2')]
Time taken: 6.190425715255737

这至少会给你一个与之比较的想法。

Answer 3

也许你应该试试flashtext 据作者说，它比正则表达快得多

作者甚至为此库发布了paper。

我亲自为我的一个项目尝试了这个库，在我看来它的API非常友好和可用。

希望它有所帮助。

Answer 4

您应该尝试字符串搜索/模式匹配算法。对你来说最着名的算法就是Aho-Corasick 它有一个python库(of the top of google search)

大多数模式匹配/字符串搜索算法都要求您转换＆＃34;包含的单词/短语＆＃34;变成了一个特里。

Answer 5

pyparsing module - 一个用于从文本中提取信息的python工具 - 将帮助您编写短语匹配。它返回一个短语的所有匹配和每个匹配的索引范围，您可以使用BNF（Backus-Naur形式）（即语法）来描述该短语。根据我的经验，它很容易使用（2），表达你可以定义的种类模式，并且速度非常快。

from pyparsing import Word, alphas
greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
hello = "Hello, World!"
print (hello, "->", greet.parseString( hello ))

使用scanString返回匹配索引：

for item in greet.scanString(hello):
    print(item)

>>> ((['Hello', ',', 'World', '!'], {}), 0, 13)

如果使用pyparsing作为表格字典

组装短语列表

phrase_list = {phrase_defined_with_pyparsing: phrase_name}

那么你的语法可以是带有标记短语的巨大OR语句。

import pyparsing as pp
your_grammar = pp.Or([phrase.setResultsName(phrase_name) for phrase, phrase_name in phrase_list.items()])
all_matches = your_grammar.scanString(big_document)

每个匹配都是一个标记（通过setResultsName）并具有索引范围的元组。

Answer 6

假设短语列表随着时间的推移而变化并且变大，我建议使用已经完成的软件，以及您需要的软件。例如。 elasticsearch，它是开源的，有一个Python client。如果在后台运行这样的服务，这将解决您想要的所有问题，并且可能超出您的想象。此外，实施起来并不难。

Answer 7

您拥有的模式数据远远多于文本数据。反转问题：将模式与文本匹配。

出于这个目的，我假设文本可以合理地标记为单词（或类似单词）。我还假设这些短语，即使它们本身不能被标记化（例如因为它们是正则表达式），但通常包含单词，并且（大多数时候）必须匹配至少一个它们包含的词语。

以下是包含三个部分的解决方案草图：

对模式进行标记和索引（一次） - 这会生成包含每个标记的模式映射
对文本和过滤器模式进行标记以查找

匹配的候选项

测试候选模式并执行替换

以下是代码：

import re import random # from nltk.corpus import words import time """ Prepare text and phrases, same as in Martin Evans's answer """ # english = words.words() with open('/usr/share/dict/american-english') as fh: english = [ x.strip() for x in fh.readlines() ] def random_phrase(l=2, h=6): return ' '.join(random.sample(english, random.randint(l, h))) texts = ['this is a phrase to match', 'another phrase this is'] # Make texts ~2000 characters texts = ['{} {}'.format(t, random_phrase(200, 200)) for t in texts] phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}] #Simulate 80k phrases for x in range(80000): phrases.append({'phrase': random_phrase(), 'link': 'link{}'.format(x)}) """ Index the patterns """ construct_time = time.time() reverse = {d['phrase']:d['link'] for d in phrases} re_phrases = [ re.compile(d['phrase'].replace(' ', r'\s+')) for d in phrases ] re_whitespace = re.compile(r'\s+') def tokenize(str): return str.split() index = {} for n in range(len(phrases)): tokens = tokenize(phrases[n]['phrase']) for token in tokens: if not token in index: index[token] = [] index[token].append(n) print('Time to construct:', time.time() - construct_time) print() for text in texts: start_time = time.time() print('{} characters - "{}..."'.format(len(text), text[:60])) """ Filter patterns to find candidates that *could* match the text """ tokens = tokenize(text) phrase_ns = [] for token in tokens: if not token in index: continue for n in index[token]: phrase_ns.append(n) phrase_ns = list(set(phrase_ns)) """ Test the candidate patterns and perform substitutions """ for n in phrase_ns: match = re.search(re_phrases[n], text) if match: print(match.span(), reverse[match.group()]) print('Time taken:', time.time() - start_time) print()

在我的环境中，此版本在16.2秒内创建一个索引，并在0.0042和0.0037秒内进行匹配（对于简单的正则表达式版本为4.7秒，加速度为~1000x）。确切的表现取决于文本和短语的统计属性，当然，这几乎总是一个巨大的胜利。

奖励：如果一个短语必须匹配多个单词（标记），您只能将它添加到索引条目中，以便它必须匹配的一个最不常见的标记，以获得另一个巨大的加速。

Answer 8

“帕特里夏树”是解决此类问题的好方法。它是一种基数树，其中基数是涉及的字符选择。因此，要查找“狗”是否在树中，请从根开始，标记“ t”分支，然后标记“ h”分支，依此类推。除了帕特里夏（Patricia）树木，这样做确实非常快。

因此，您可以遍历文本，并且可以获取所有命中的树位置（短语）。如果需要的话，这甚至会让您重叠匹配。

有关它们的主要文章是Donald R. Morrison，PATRICIA-检索字母数字信息的实用算法，ACM杂志，15（4）：514-534，1968年10月。{{3} } github上有几种实现，尽管我不知道哪种是好的。

在文本

8 个答案: