我做了很多研究,但没有发现任何真正对我有帮助的东西。也许我的方法很奇怪-也许有人可以将我的想法告诉正确的方向。
这就是这种情况:
我需要处理大量文本(十万)。在这些文本中,我需要查找并处理某些字符串:
很明显,这导致了愚蠢的迭代,因为每个文本都需要输入一个通过数十万个正则表达式运行的函数,毕竟这将导致运行时间非常长。
是否有更好,更快的方法来完成所需的任务?现在,它的工作方式可以运行,但速度很慢,并且使服务器承受数周的沉重负担。
一些示例代码可以说明我的想法:
import re
cases = [] # 100 000 case numbers from db
suffixes = [] # 500 diffrent suffixes to try from db
texts = [] # 100 000 for the beginning - will become less after initial run
def process_item(text: str) -> str:
for s in suffixes:
pattern = '(...)(.*?)(%s|...)' % s
x = re.findall(pattern, text, re.IGNORECASE)
for match in x:
# process the matches, where I need to know which suffix matched
pass
for c in cases:
escaped = re.escape(c)
x = re.findall(escaped, text, re.IGNORECASE)
for match in x:
# process the matches, where I need to know which number matched
pass
return text
for text in texts:
processed = process_item(text)
每个想法都受到高度赞赏!
答案 0 :(得分:2)
我无法评论,只是一些想法:
从您发布的内容来看,您想要搜索的内容始终是相同的,所以为什么不将它们加入大型regexp中并在运行循环之前编译大型的regexp。
这样,您不必为每次迭代都编译正则表达式,而只需编译一次。
例如
import re
cases = [] # 100 000 case numbers from db
suffixes = [] # 500 diffrent suffixes to try from db
texts = [] # 100 000 for the beginning - will become less after initial run
bre1 = re.compile('|'.join(suffixes), re.IGNORECASE)
bre2 = re.compile('|'.join([re.escape(c) for c in cases]), re.IGNORECASE)
def process_item(text: str) -> str:
x = re.findall(bre1, text)
for match in x:
# process the matches, where I need to know which suffix matched
pass
x = re.findall(bre1, text)
for match in x:
# process the matches, where I need to know which number matched
pass
return text
for text in texts:
processed = process_item(text)
如果您可以在case number
中可靠地找到text
(例如,如果前面有一些标识符),则最好使用re.search
找到案例编号并获得案例set
中的数字,并测试该集合中的成员资格。
例如
cases = ["123", "234"]
cases_set = set(cases)
texts = ["id:123", "id:548"]
sre = re.compile(r'(?<=id:)\d{3}')
for t in texts:
m = re.search(sre, t)
if m and m.group() in cases_set:
# do stuff ....
pass