Question

我正在研究一些地址清理/地理编码软件，最近我遇到了一个特殊的地址格式，这对我来说有些问题。

我的外部地理编码模块无法找到30 w 60th new york等地址（30 w 60th street new york是地址的正确格式）。

基本上我需要做的是解析字符串并检查以下内容：

th或st或nd或rd后面是否有任何数字？（+跟随他们的空格）。 I.E 33rd 34th 21st 24th
如果是这样，后面跟street这个词是什么？

如果是的话，什么也不做。

如果不是，请在特定模式之后立即添加单词street？

正则表达式是解决这种情况的最佳方法吗？

进一步澄清：我对其他地址后缀没有任何问题，例如大道，道路等等。我分析了非常大的数据集（我跑了大约12,000个地址/日期通过我的应用程序），以及遗漏street的实例是导致我最头痛的原因。我已经研究过地址解析模块，例如usaddress，smartystreets等。我真的只需要为我所描述的具体问题提出一个干净的（希望是正则表达式的）解决方案。

我正在思考以下几点：

将字符串转换为列表。
在列表中找到符合我已解释的
检查下一个元素是否为street。如果是这样，什么都不做。
如果没有，请使用[:targetword + len(targetword)] + 'street' + [:targetword + len(targetword)]重建列表。（targetword将为47th或字符串中的任何内容）
将列表重新加入字符串。

我对正则表达式并不是最好的，所以我正在寻找一些意见。

感谢。

Answer 1

看来你正在寻找正则表达式。 = P

这里有一些我特别为你构建的代码：

import re


def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r"(?P<number>[\d]{1,3}(st|nd|rd|th)\s)(?P<following>.*)")

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # then check if not followed by 'street'
        if re.match('street', has_number.group('following')) is None:
            # then add the 'street' word
            new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
            return new_address
        else:
            return True # the format is good (followed by 'street')
    else:
        return True # there is no number like 'th, st, nd, rd'

我是蟒蛇学习者，所以，谢谢你让我知道它是否解决了你的问题。

测试了一小部分地址。

希望它有助于或引导您找到解决方案。

谢谢！

编辑

如果遵循“大道”或“道路”以及“街道”，则需要小心：

import re


def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,3}(th|st|nd|rd)\s)(?P<following>.*)')

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # check if followed by "avenue" or "road" or "street"
        if re.match(r'(avenue|road|street)', has_number.group('following')):
            return True # do nothing
        # else add the "street" word
        else:
            # then add the 'street' word
            new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
            return new_address
    else:
        return True # there is no number like 'th, st, nd, rd'

<强> RE-修改

我根据您的需求做了一些改进，并添加了一个使用示例：

import re


# build the original address list includes bad format
address_list = [
    '30 w 60th new york',
    '30 w 60th new york',
    '30 w 21st new york',
    '30 w 23rd new york',
    '30 w 1231st new york',
    '30 w 1452nd new york',
    '30 w 1300th new york',
    '30 w 1643rd new york',
    '30 w 22nd new york',
    '30 w 60th street new york',
    '30 w 60th street new york',
    '30 w 21st street new york',
    '30 w 22nd street new york',
    '30 w 23rd street new york',
    '30 w brown street new york',
    '30 w 1st new york',
    '30 w 2nd new york',
    '30 w 116th new york',
    '30 w 121st avenue new york',
    '30 w 121st road new york',
    '30 w 123rd road new york',
    '30 w 12th avenue new york',
    '30 w 151st road new york',
    '30 w 15th road new york',
    '30 w 16th avenue new york'
]


def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # check if followed by "avenue" or "road" or "street"
        if re.match(r'(avenue|road|street)', has_number.group('following')):
            return address # return original address
        # else add the "street" word
        else:
            new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address)
            return new_address
    else:
        return address # there is no number like 'th, st, nd, rd' -> return original address


# initialisation of the new list
new_address_list = []

# built the new clean list
for address in address_list:
    new_address_list.append(check_th_add_street(address))
    # or you could use it straight here i.e. :
    # address = check_th_add_street(address)
    # print address

# use the new list to do you work
for address in new_address_list:
    print "Formated address is : %s" % address # or what ever you want to do with 'address'

将输出：

Formated address is : 30 w 60th street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 21st street new york
Formated address is : 30 w 23rd street new york
Formated address is : 30 w 1231st street new york
Formated address is : 30 w 1452nd street new york
Formated address is : 30 w 1300th street new york
Formated address is : 30 w 1643rd street new york
Formated address is : 30 w 22nd street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 21st street new york
Formated address is : 30 w 22nd street new york
Formated address is : 30 w 23rd street new york
Formated address is : 30 w brown street new york
Formated address is : 30 w 1st street new york
Formated address is : 30 w 2nd street new york
Formated address is : 30 w 116th street new york
Formated address is : 30 w 121st avenue new york
Formated address is : 30 w 121st road new york
Formated address is : 30 w 123rd road new york
Formated address is : 30 w 12th avenue new york
Formated address is : 30 w 151st road new york
Formated address is : 30 w 15th road new york
Formated address is : 30 w 16th avenue new york

重新编辑

最终函数：将count参数添加到re.sub（）

def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # check if followed by "avenue" or "road" or "street"
        if re.match(r'(avenue|road|street)', has_number.group('following')):
            return address # do nothing
        # else add the "street" word
        else:
            # then add the 'street' word
            new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address, 1) # the last parameter is the maximum number of pattern occurences to be replaced
            return new_address
    else:
        return address # there is no number like 'th, st, nd, rd'

Answer 2

虽然你当然可以使用正则表达式解决这类问题，但我无法提供帮助，但我认为最有可能的Python库已经解决了这个问题。我从未使用过这些，但只是一些快速搜索找到了我：

https://github.com/datamade/usaddress

https://pypi.python.org/pypi/postal-address

https://github.com/SwoopSearch/pyaddress

PyParsing在这里也有一个地址示例，你可以看一下：http://pyparsing.wikispaces.com/file/view/streetAddressParser.py

您还可以查看以前的问题：is there a library for parsing US addresses?

您有什么理由不能使用第三方库来解决问题吗？

编辑：Pyparsing移动了他们的网址：https://github.com/pyparsing/pyparsing

Answer 3

您可以通过将每个字符串转换为列表，并在这些列表中查找某些字符组来实现此目的。例如：

def check_th(address):
    addressList = list(address)
    for character in addressList:
        if character == 't':
             charIndex = addressList.index(character)
             if addressList[charIndex + 1] == 'h':
                 numberList = [addressList[charIndex - 2], addressList[charIndex - 1]]
                 return int(''.join(str(x) for x in numberList))

这看起来非常混乱，但它应该完成工作，只要数字长度为两位数。但是，如果你需要寻找很多东西，你应该寻找一种更方便，更简单的方法。

Answer 4

要检查并添加单词street，只要街道号位于其名称之前，以下功能就可以使用：

def check_add_street(address):

    addressList = list(address)

    for character in addressList:
        if character == 't':
            charIndex_t = addressList.index(character)
            if addressList[charIndex_t + 1] == 'h':
                newIndex = charIndex_t + 1
                break

        elif character == 's':
            charIndex_s = addressList.index(character)
            if addressList[charIndex_s + 1] == 't':
                newIndex = charIndex_s + 1
                break

        elif character == 'n':
            charIndex_n = addressList.index(character)
            if addressList[charIndex_n + 1] == 'd':
                newIndex = charIndex_n + 1
                break

        elif character == 'r':
            charIndex_r = addressList.index(character)
            if addressList[charIndex_r + 1] == 'd':
                newIndex = charIndex_r + 1
                break

    if addressList[newIndex + 1] != ' ' or addressList[newIndex + 2] != 's' or addressList[newIndex + 3] != 't' or addressList[newIndex + 4] != 'r' or addressList[newIndex + 5] != 'e' or addressList[newIndex + 6] != 'e' or addressList[newIndex + 7] != 't' or addressList[newIndex + 8] != ' ':
        newAddressList = []

        for n in range(len(addressList)):
            while n <= newIndex:
                newAddressList.append(addressList[n])

        newAddressList.append(' ')
        newAddressList.append('s')
        newAddressList.append('t')
        newAddressList.append('r')
        newAddressList.append('e')
        newAddressList.append('e')
        newAddressList.append('t')

        for n in range(len(addressList) - newIndex):
            newAddressList.append(addressList[n + newIndex])

        return ''.join(str(x) for x in newAddressList)

    else:
        return ''.join(str(x) for x in addressList)

这将添加＆＃34; street＆＃34;如果它不存在，则鉴于您上面给出的格式是一致的。

在字符串中搜索模式，如果找到则添加字符

4 个答案: