Question

我正在使用正则表达式来通过某些脚本从页面中收集一些值。我在条件中使用re.match但它返回false但是如果我使用finditer它返回true并执行条件体。我在我自己构建的测试器中测试了这个正则表达式，它在那里工作但不是在脚本中。这是示例脚本。

result = []
RE_Add0 = re.compile("\d{5}(?:(?:-| |)\d{4})?", re.IGNORECASE)
each = ''Expiration Date:\n05/31/1996\nBusiness Address: 23901 CALABASAS ROAD #2000 CALABASAS, CA 91302\n'
if RE_Add0.match(each):
    result0 = RE_Add0.match(each).group(0)
    print result0
    if len(result0) < 100:
        result.append(result0)
    else:
        print 'Address ignore'
else:
    None

Answer 1

即使没有匹配项，

re.finditer()也会返回迭代器对象（因此if RE_Add0.finditer(each)将始终返回True）。您必须实际迭代对象以查看是否存在实际匹配。

然后，re.match()仅匹配字符串的开头，而不是匹配字符串中re.search()或re.finditer()的任何位置。

第三，正则表达式可以写成r"\d{5}(?:[ -]?\d{4})"。

第四，始终使用带有正则表达式的原始字符串。

Answer 2

re.match仅在字符串的开头匹配一次。在这方面re.finditer类似于re.search，即它迭代匹配。比较：

>>> re.match('a', 'abc')
<_sre.SRE_Match object at 0x01057AA0>
>>> re.match('b', 'abc')
>>> re.finditer('a', 'abc')
<callable_iterator object at 0x0106AD30>
>>> re.finditer('b', 'abc')
<callable_iterator object at 0x0106EA10>

ETA：既然你提到 page ，我只能猜测你正在谈论html解析，如果是这种情况，请使用BeautifulSoup或类似的html解析器。不要使用正则表达式。

Answer 3

试试这个：

import re

postalCode = re.compile(r'((\d{5})([ -])?(\d{4})?(\s*))$')
primaryGroup = lambda x: x[1]

sampleStr = """
    Expiration Date:
    05/31/1996
    Business Address: 23901 CALABASAS ROAD #2000 CALABASAS, CA 91302  
"""
result = []

matches = list(re.findall(postalCode, sampleStr))
if matches:
    for n,match in enumerate(matches): 
        pc = primaryGroup(match)
        print pc
        result.append(pc)
else:
    print "No postal code found in this string"

这会在任何

上返回'12345'

12345\n
12345  \n
12345 6789\n
12345 6789    \n
12345 \n
12345     \n
12345-6789\n
12345-6789    \n
12345-\n
12345-    \n
123456789\n
123456789    \n
12345\n
12345    \n

我只在一行的末尾匹配，因为否则它在你的例子中也匹配'23901'（来自街道地址）。

使用re.finditer和re.match时的不同行为

3 个答案: