假设您有成千上万的行和数百个正则表达式。
如何使此代码更快?
我需要两个数组的索引,作为输出(最后在文件中)。
import re
from timeit import default_timer as timer
a = ['apple_789456',
'banana_741',
'pear_11112222',
'orange_454545',
'pineapple_7777888',
'banana_999999'
]
regs = [r'ple.*?7',
r'a.*?74',
r'range.*?5',
r'45'
]
regs_re = [re.compile(r) for r in regs]
start = timer()
for i in range(len(a)):
for j in range(len(regs)):
if re.search(regs_re[j],a[i]):
print('regs_re['+str(j)+'] found in a['+str(i)+']: '+a[i])
print(timer() - start)
答案 0 :(得分:1)
一种加速它的方法是在整个文本(连接的行)上只执行一次每个表达式。这将不会以相同的顺序产生结果,但会使其在数千行上快几倍。
很明显,随时随地进行打印都会完全浪费时间,因此我将结果放在列表中以比较执行时间。
from bisect import bisect_left
from itertools import accumulate
start = timer()
text = "\n".join(a) # single string with all lines
lineIndex = [i for i,c in enumerate(text) if c=="\n"] # map positions to line number
result = [] # accumulate results in a list
for j,expr in enumerate(regs): # execute each expression only once
previ = -1 # finditer may find multiple occurrences on same line
for m in re.finditer(expr,text): # go through all occurrences
i = bisect_left(lineIndex,m.start()) # determine line number
if i == previ: continue
previ = i
result.append((i,j)) # build result list
print(timer() - start)
for i,j in result:
print(f"regs_re[{j}] found i a[{i}]: {a[i]}")