Python:检查单词列表中的任何单词是否与正则表达式模式列表中的任何模式匹配

时间:2013-06-12 14:45:54

标签: python regex

我在.txt文件中有很长的单词列表和regular expression patterns,我这样读过:

with open(fileName, "r") as f1:
    pattern_list = f1.read().split('\n')

为了说明,前七个看起来像这样:

print pattern_list[:7] 
# ['abandon*', 'abuse*', 'abusi*', 'aching', 'advers*', 'afraid', 'aggress*']

每当我将输入字符串中的单词与pattern_list中的任何单词/模式匹配时,我想知道。下面的有效,但我看到两个问题:

  1. 首先,每当我检查一个新的string_input时,re.compile()我的pattern_list中的每个项似乎都是非常低效的...但当我试图将re.compile(raw_str)对象存储在一个列表中时(到那时)能够重用已经编译的正则表达式列表更像if w in regex_compile_list:,它不能正常工作。)
  2. 其次,它有时不像我期望的那样工作 - 注意如何
    • 滥用*与滥用相匹配
    • abusi *与虐待和虐待相匹配
    • 疼痛*与疼痛相匹配
  3. 我做错了什么,我怎样才能更有效率?提前感谢您对菜鸟的耐心,并感谢您的见解!

    string_input = "People who have been abandoned or abused will often be afraid of adversarial, abusive, or aggressive behavior. They are aching to abandon the abuse and aggression."
    for raw_str in pattern_list:
        pat = re.compile(raw_str)
        for w in string_input.split():
            if pat.match(w):
                print "matched:", raw_str, "with:", w
    #matched: abandon* with: abandoned
    #matched: abandon* with: abandon
    #matched: abuse* with: abused
    #matched: abuse* with: abusive,
    #matched: abuse* with: abuse
    #matched: abusi* with: abused
    #matched: abusi* with: abusive,
    #matched: abusi* with: abuse
    #matched: ache* with: aching
    #matched: aching with: aching
    #matched: advers* with: adversarial,
    #matched: afraid with: afraid
    #matched: aggress* with: aggressive
    #matched: aggress* with: aggression.
    

4 个答案:

答案 0 :(得分:8)

对于匹配shell样式的通配符,您可以(ab)使用模块fnmatch

由于fnmatch主要用于文件名比较,因此测试将区分大小写或不依赖于您的操作系统。因此,您必须对文本和模式进行规范化(此处,我为此目的使用lower()

>>> import fnmatch

>>> pattern_list = ['abandon*', 'abuse*', 'abusi*', 'aching', 'advers*', 'afraid', 'aggress*']
>>> string_input = "People who have been abandoned or abused will often be afraid of adversarial, abusive, or aggressive behavior. They are aching to abandon the abuse and aggression."


>>> for pattern in pattern_list:
...     l = fnmatch.filter(string_input.split(), pattern)
...     if l:
...             print pattern, "match", l

产:

abandon* match ['abandoned', 'abandon']
abuse* match ['abused', 'abuse']
abusi* match ['abusive,']
aching match ['aching']
advers* match ['adversarial,']
afraid match ['afraid']
aggress* match ['aggressive', 'aggression.']

答案 1 :(得分:2)

abandon*将匹配abandonnnnnnnnnnnnnnnnnnnnnnn,而不是abandonasfdsafdasf。你想要

abandon.*

代替。

答案 2 :(得分:2)

如果*都在字符串的末尾,你可能想要这样做:

for pat in pattern_list:
    for w in words:
        if pat[-1] == '*' and w.startswith(pat[:-1]) or w == pat:
            # Do stuff

答案 3 :(得分:1)

如果模式使用正则表达式语法:

m = re.search(r"\b({})\b".format("|".join(patterns)), input_string)
if m:
    # found match

如果单词以空格分隔,请使用(?:\s+|^)(?:\s+|$)代替\b