Question

我有一个从互联网上获取信息的脚本。简而言之，我最终得到一个包含字符串的变量。根据这个字符串，我编写了脚本来决定是否根据以下内容丢弃或进一步处理字符串所属的信息：

有一个特定的词
或字符串中特定顺序的几个单词。

我想知道什么是最好的算法来有效地实现这一目标，并且具有良好的，但不是100％正确的准确度。

目前我有以下代码（大量减少到只包括特定部分;通常有围绕此等的循环等）：

#!/usr/bin/env python
import re
def findWord(z):
    return re.compile(r'\b({0})\b'.format(z), flags=re.IGNORECASE).search

filterList = [
              "term-1","term-2","term-n"
             ]
uncleanString = "This! is* a test [string],.}".lower()

#Remove all punctuation
for c in "!@#%&*()[]{}/?<>,.'":
    cleanString = uncleanString.replace(c, "")

#Check if the words in filterList are present, if not then process further
no = 0
for word in filterList:
    result = findWord(filterList[filterList.index(word)])(cleanString)
    if result == None:
        pass
    else:
        no = 1
        break

    if no == 0:
        #then do further processing here, e.g.
        print(cleanString)
        #reset condition (when implementing code in loop(s)
        no = 0

在我的实际脚本中，我的filterList很大。它是一个很慢的脚本，需要大约30分钟来完成，虽然我认为这更多是由于我运行它的平台（RPi而不是PyPy），与互联网的通信（BS4 / HTTPlib），以及与MySQL数据库的交互......在我改进其他部分之前，您是否有任何关于如何加快这部分速度的想法，或者您会说上述内容是否合适？

Answer 1

用替代词制作一个大的正则表达式。

reg=re.compile(r'\b('+"|".join(filterList)+r')\b')

看起来像这样

\b(term-1|term-2|term-n)\b

您不必循环使用术语项，它们都在一个正则表达式

中

在编译对象上调用一次，而不是“findWord”调用

reg.search

Answer 2

你可以使它更具可读性：

if not any(word in cleanString for word in filterList):
    # further processing

这会删除字符串格式和正则表达式编译步骤。

（Python）过滤字符串中特定术语的最佳方法？

2 个答案: