Python确保地址符合特定格式

时间:2015-06-19 14:11:38

标签: python regex street-address

我一直在玩正则表达式,但还没有运气。我需要介绍一些地址验证。我需要确保用户定义的地址符合以下格式:

"717 N 2ND ST, MANKATO, MN 56001"

或者也可能是这个:

"717 N 2ND ST, MANKATO, MN, 56001"

并抛弃其他所有内容并提醒用户它是不正确的格式。我一直在查看文档,并尝试过很多正则表达式模式。我试过这个(和许多变化)没有任何运气:

pat = r'\d{1,6}(\w+),\s(w+),\s[A-Za-z]{2}\s{1,6}'

这个有用,但它允许太多的垃圾,因为它只是确保它以门牌号开头并以邮政编码结束(我认为):

pat = r'\d{1,6}( \w+){1,6}'

逗号位置是至关重要的,因为我用逗号分割输入字符串所以我的第一项是地址,然后是城市,然后状态和zip被空格分割(这里我想使用第二个正则表达式他们在州和邮政编码之间有一个逗号。

基本上我想这样做:

# check for this format "717 N 2ND ST, MANKATO, MN 56001"
pat_1 = 'regex to match above pattern'
if re.match(pat_1, addr, re.IGNORECASE):
    # extract address 

# check for this pattern "717 N 2ND ST, MANKATO, MN, 56001"
pat_2 = 'regex to match above format'
if re.match(pat_2, addr, re.IGNORECASE):
    # extract address 

else:
    raise ValueError('"{}" must match this format: "717 N 2ND ST, MANKATO, MN 56001"'.format(addr))

# do stuff with address

如果有人能帮我制作正则表达式以确保模式匹配,我将非常感激!

4 个答案:

答案 0 :(得分:1)

这个怎么样:

((\ W | \ S)+),((\ W | \ S)+)?,\ S *(\ W {2})\ S *,\ S *(\ d {5}) *

您还可以使用它分别提取\ 1,\ 3,\ 5和\ 6中的街道,城市,州和邮政编码。它会分别匹配街道和城市的最后一个字母,但这不会影响有效性。

答案 1 :(得分:1)

\d{1,6}\s\w+\s\w+\s[A-Za-z]{2},\s([A-Za-z]+),\s[A-Za-z]{2}(,\s\d{1,6}|\s\d{1,6})

您可以在此链接中测试正则表达式:https://regex101.com/r/yN7hU9/1

答案 2 :(得分:1)

你可以用这个:

\d{1,6}(\s\w+)+,(\s\w+)+,\s[A-Z]{2},?\s\d{1,6}

它将匹配以门牌号开头的字符串,然后是逗号后跟任意数量的字。然后它将寻找一个由至少一个单词后跟昏迷组成的城市名称。接下来它将查找正好2个大写字母,后跟可选的逗号。然后是邮政编码。

答案 3 :(得分:1)

这可能有所帮助。为了可维护性,我倾向于使用带有嵌入注释的详细正则表达式。

另请注意(?P<name>pattern)的使用。这有助于记录匹配的意图,并且如果您的需求超出简单的正则表达式验证,还提供了一种有用的提取数据的机制。

import re

# Goal:  '717 N 2ND ST, MANKATO, MN 56001',
# Goal:  '717 N 2ND ST, MANKATO, MN, 56001',
regex = r'''
    (?x)            # verbose regular expression
    (?i)            # ignore case
    (?P<HouseNumber>\d+)\s+        # Matches '717 '
    (?P<Direction>[news])\s+       # Matches 'N '
    (?P<StreetName>\w+)\s+         # Matches '2ND '
    (?P<StreetDesignator>\w+),\s+  # Matches 'ST, '
    (?P<TownName>.*),\s+           # Matches 'MANKATO, '
    (?P<State>[A-Z]{2}),?\s+       # Matches 'MN ' and 'MN, '
    (?P<ZIP>\d{5})                 # Matches '56001'
'''

regex = re.compile(regex)

for item in (
    '717 N 2ND ST, MANKATO, MN 56001',
    '717 N 2ND ST, MANKATO, MN, 56001',
    '717 N 2ND, Makata, 56001',   # Should reject this one
    '1234 N D AVE, East Boston, MA, 02134',
    ):
    match = regex.match(item)
    print item
    if match:
        print "    House is on {Direction} side of {TownName}".format(**match.groupdict())
    else:
        print "    invalid entry"

要使某些字段可选,我们会将+替换为*,因为+表示一个或多个,而*表示ZERO或更多。这是一个与评论中的新要求相匹配的版本:

import re

# Goal:  '717 N 2ND ST, MANKATO, MN 56001',
# Goal:  '717 N 2ND ST, MANKATO, MN, 56001',
# Goal:  '717 N 2ND ST NE, MANKATO, MN, 56001',
# Goal:  '717 N 2ND, MANKATO, MN, 56001',
regex = r'''
    (?x)            # verbose regular expression
    (?i)            # ignore case
    (?P<HouseNumber>\d+)\s+         # Matches '717 '
    (?P<Direction>[news])\s+        # Matches 'N '
    (?P<StreetName>\w+)\s*          # Matches '2ND ', with optional trailing space
    (?P<StreetDesignator>\w*)\s*    # Optionally Matches 'ST '
    (?P<StreetDirection>[news]*)\s* # Optionally Matches 'NE'
    ,\s+                            # Force a comma after the street
    (?P<TownName>.*),\s+            # Matches 'MANKATO, '
    (?P<State>[A-Z]{2}),?\s+        # Matches 'MN ' and 'MN, '
    (?P<ZIP>\d{5})                  # Matches '56001'
'''

regex = re.compile(regex)

for item in (
    '717 N 2ND ST, MANKATO, MN 56001',
    '717 N 2ND ST, MANKATO, MN, 56001',
    '717 N 2ND, Makata, 56001',   # Should reject this one
    '1234 N D AVE, East Boston, MA, 02134',
    '717 N 2ND ST NE, MANKATO, MN, 56001',
    '717 N 2ND, MANKATO, MN, 56001',
    ):
    match = regex.match(item)
    print item
    if match:
        print "    House is on {Direction} side of {TownName}".format(**match.groupdict())
    else:
        print "    invalid entry"

接下来,考虑OR运算符|和非捕获组运算符(?:pattern)。他们可以一起用输入格式描述复杂的替代方案。此版本符合新要求,即某些地址在街道名称前面有方向,有些地址在街道名称后面有方向,但没有地址在两个地方都有方向。

import re

# Goal:  '717 N 2ND ST, MANKATO, MN 56001',
# Goal:  '717 N 2ND ST, MANKATO, MN, 56001',
# Goal:  '717 2ND ST NE, MANKATO, MN, 56001',
# Goal:  '717 N 2ND, MANKATO, MN, 56001',
regex = r'''
    (?x)            # verbose regular expression
    (?i)            # ignore case
    (?: # Matches any sort of street address
        (?: # Matches '717 N 2ND ST' or '717 N 2ND'
            (?P<HouseNumber>\d+)\s+      # Matches '717 '
            (?P<Direction>[news])\s+     # Matches 'N '
            (?P<StreetName>\w+)\s*       # Matches '2ND ', with optional trailing space
            (?P<StreetDesignator>\w*)\s* # Optionally Matches 'ST '
        )
        | # OR
        (?:  # Matches '717 2ND ST NE' or '717 2ND NE'
            (?P<HouseNumber2>\d+)\s+      # Matches '717 '
            (?P<StreetName2>\w+)\s+       # Matches '2ND '
            (?P<StreetDesignator2>\w*)\s* # Optionally Matches 'ST '
            (?P<Direction2>[news]+)       # Matches 'NE'
        )
    )
    ,\s+                             # Force a comma after the street
    (?P<TownName>.*),\s+             # Matches 'MANKATO, '
    (?P<State>[A-Z]{2}),?\s+         # Matches 'MN ' and 'MN, '
    (?P<ZIP>\d{5})                   # Matches '56001'
'''

regex = re.compile(regex)

for item in (
    '717 N 2ND ST, MANKATO, MN 56001',
    '717 N 2ND ST, MANKATO, MN, 56001',
    '717 N 2ND, Makata, 56001',   # Should reject this one
    '1234 N D AVE, East Boston, MA, 02134',
    '717 2ND ST NE, MANKATO, MN, 56001',
    '717 N 2ND, MANKATO, MN, 56001',
    ):
    match = regex.match(item)
    print item
    if match:
        d = match.groupdict()
        print "    House is on {0} side of {1}".format(
            d['Direction'] or d['Direction2'],
            d['TownName'])
    else:
        print "    invalid entry"