Question

我做了很多研究，但没有发现任何真正对我有帮助的东西。也许我的方法很奇怪-也许有人可以将我的想法告诉正确的方向。

这就是这种情况：

我需要处理大量文本（十万）。在这些文本中，我需要查找并处理某些字符串：

我从数据库中提取的某些“静态”子字符串（例如案例编号）（也是数十万个）
我与正则表达式匹配的字符串，该正则表达式是动态构建的，以匹配所有可能出现的情况-其中正则表达式的最后一部分将被动态设置

很明显，这导致了愚蠢的迭代，因为每个文本都需要输入一个通过数十万个正则表达式运行的函数，毕竟这将导致运行时间非常长。

是否有更好，更快的方法来完成所需的任务？现在，它的工作方式可以运行，但速度很慢，并且使服务器承受数周的沉重负担。

一些示例代码可以说明我的想法：

import re

cases = []          # 100 000 case numbers from db
suffixes = []       #  500 diffrent suffixes to try from db

texts = []          # 100 000 for the beginning - will become less after initial run

def process_item(text: str) -> str:
    for s in suffixes:
        pattern = '(...)(.*?)(%s|...)' % s
        x = re.findall(pattern, text, re.IGNORECASE)
        for match in x:
            # process the matches, where I need to know which suffix matched
            pass
    for c in cases:
        escaped = re.escape(c)
        x = re.findall(escaped, text, re.IGNORECASE)
        for match in x:
            # process the matches, where I need to know which number matched
            pass

    return text


for text in texts:
    processed = process_item(text)

每个想法都受到高度赞赏！

Answer 1

我无法评论，只是一些想法：

从您发布的内容来看，您想要搜索的内容始终是相同的，所以为什么不将它们加入大型regexp中并在运行循环之前编译大型的regexp。

这样，您不必为每次迭代都编译正则表达式，而只需编译一次。

例如

import re

cases = []          # 100 000 case numbers from db
suffixes = []       #  500 diffrent suffixes to try from db

texts = []          # 100 000 for the beginning - will become less after initial run

bre1 = re.compile('|'.join(suffixes), re.IGNORECASE)
bre2 = re.compile('|'.join([re.escape(c) for c in cases]), re.IGNORECASE)

def process_item(text: str) -> str:
    x = re.findall(bre1, text)
    for match in x:
        # process the matches, where I need to know which suffix matched
        pass

   x = re.findall(bre1, text)
   for match in x:
       # process the matches, where I need to know which number matched
       pass

    return text


for text in texts:
    processed = process_item(text)

如果您可以在case number中可靠地找到text（例如，如果前面有一些标识符），则最好使用re.search找到案例编号并获得案例set中的数字，并测试该集合中的成员资格。

例如

cases = ["123", "234"]
cases_set = set(cases)

texts = ["id:123", "id:548"]

sre = re.compile(r'(?<=id:)\d{3}')
for t in texts:
    m = re.search(sre, t)
    if m and m.group() in cases_set:
        # do stuff ....
        pass

Python regex性能：用数千个regex遍历文本的最佳方法

1 个答案: