Question

我正在寻找专业提示。我在数据库中有一个字符串列表，如“MD”，“PHD”，“MR”等各种称呼。这是几百行，我按特定顺序收到它（MD比MR更重要）。我还有一系列人物对象，我将迭代并需要一种非常有效的匹配方式。我试过两次，也许没有其他方法。

我的第一次尝试是当我收到列表时，重新编译每个列表并将它们放入列表中。然后...

theregexlist = ["MR", "DR", "MRS" ... "MISS", "PHD"] #several hundred
personname = "MR JOEY SMITH" #other examples are similar like "BOBBY DR MD JOE"
for theregex in theregexlist:
    if re.search(theregex, personname):
        do stuffs....
        break #since my list is ordered, I only want the first match

确实有效。我还尝试循环regexlist并构建一个巨大的匹配正则表达式来捕获parans，重新编译它，然后：

hugeregex = re.compile("(?:(MR)|(MR)|(PHD)| ...  |(DR)|(MD))")
personname = "FRED DR FLINTSTONE"
maybematch = re.search(hugeregex, personname)
if maybematch:
    print (maybematch.group(0))

是否有某种地图，杠杆键或迭代功能，我只是没想到会更有效率？任何和所有的想法都很感激！即使它是“Yup，它只会慢，尝试使用timeit来查看哪个更快”，然后我可以停止搜索:)谢谢！

Answer 1

具有所有“特征”（如“MR”，“MS”等）的“大”RegEx将更有效，因为它只会被编译一次。并减少函数调用（这是一种优化）。

如果在一个粒子中有特殊字符，您可能需要使用search来转义它们。

您可以编译RegEx并获取对import re particules = ["MR", "DR", "MRS", "MISS", "PHD"] regex = r"\b(?:" + "|".join(map(re.escape, particules)) + r")\b" search_any_particule = re.compile(regex, flags=re.IGNORECASE).search personname = "FRED DR FLINTSTONE" mo = search_any_particule(personname) if mo: print(mo.group())方法的引用。

以下是一个例子：

cProfile

你得到：'DR'。

修改

确保实施效率的最佳方法是个人资料。为此，您可以使用def find_particule(personname): mo = search_any_particule(personname) if mo: return mo.group() return None import cProfile cProfile.runctx('for i in range(1000000): find_particule("FRED DR FLINTSTONE")', globals(), locals())库。

例如：

3000003 function calls in 2.110 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.353 0.353 2.110 2.110 <string>:1(<module>) 1000000 0.495 0.000 1.757 0.000 python:10(find_particule) 1 0.000 0.000 2.110 2.110 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1000000 0.185 0.000 0.185 0.000 {method 'group' of '_sre.SRE_Match' objects} 1000000 1.078 0.000 1.078 0.000 {method 'search' of '_sre.SRE_Pattern' objects}

探查器会给你这样的东西：

{{1}}

python - 正则表达式模式是大量的项目，最佳实践？

1 个答案: