在字符串中搜索模式,如果找到则添加字符

时间:2016-08-10 18:20:44

标签: python

我正在研究一些地址清理/地理编码软件,最近我遇到了一个特殊的地址格式,这对我来说有些问题。

我的外部地理编码模块无法找到30 w 60th new york等地址(30 w 60th street new york是地址的正确格式)。

基本上我需要做的是解析字符串并检查以下内容:

  1. thstndrd后面是否有任何数字? (+跟随他们的空格)。 I.E 33rd 34th 21st 24th
  2. 如果是这样,后面跟street这个词是什么?
  3. 如果是的话,什么也不做。

    如果不是,请在特定模式之后立即添加单词street

    正则表达式是解决这种情况的最佳方法吗?

    进一步澄清:我对其他地址后缀没有任何问题,例如大道,道路等等。我分析了非常大的数据集(我跑了大约12,000个地址/日期通过我的应用程序),以及遗漏street的实例是导致我最头痛的原因。我已经研究过地址解析模块,例如usaddress,smartystreets等。我真的只需要为我所描述的具体问题提出一个干净的(希望是正则表达式的)解决方案。

    我正在思考以下几点:

    1. 将字符串转换为列表。
    2. 在列表中找到符合我已解释的
    3. 标准的元素索引
    4. 检查下一个元素是否为street。如果是这样,什么都不做。
    5. 如果没有,请使用[:targetword + len(targetword)] + 'street' + [:targetword + len(targetword)]重建列表。 (targetword将为47th或字符串中的任何内容)
    6. 将列表重新加入字符串。
    7. 我对正则表达式并不是最好的,所以我正在寻找一些意见。

      感谢。

4 个答案:

答案 0 :(得分:2)

看来你正在寻找正则表达式。 = P

这里有一些我特别为你构建的代码:

import re


def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r"(?P<number>[\d]{1,3}(st|nd|rd|th)\s)(?P<following>.*)")

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # then check if not followed by 'street'
        if re.match('street', has_number.group('following')) is None:
            # then add the 'street' word
            new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
            return new_address
        else:
            return True # the format is good (followed by 'street')
    else:
        return True # there is no number like 'th, st, nd, rd'

我是蟒蛇学习者,所以,谢谢你让我知道它是否解决了你的问题。

测试了一小部分地址。

希望它有助于或引导您找到解决方案。

谢谢!

编辑

如果遵循“大道”或“道路”以及“街道”,则需要小心:

import re


def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,3}(th|st|nd|rd)\s)(?P<following>.*)')

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # check if followed by "avenue" or "road" or "street"
        if re.match(r'(avenue|road|street)', has_number.group('following')):
            return True # do nothing
        # else add the "street" word
        else:
            # then add the 'street' word
            new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
            return new_address
    else:
        return True # there is no number like 'th, st, nd, rd'

<强> RE-修改

我根据您的需求做了一些改进,并添加了一个使用示例:

import re


# build the original address list includes bad format
address_list = [
    '30 w 60th new york',
    '30 w 60th new york',
    '30 w 21st new york',
    '30 w 23rd new york',
    '30 w 1231st new york',
    '30 w 1452nd new york',
    '30 w 1300th new york',
    '30 w 1643rd new york',
    '30 w 22nd new york',
    '30 w 60th street new york',
    '30 w 60th street new york',
    '30 w 21st street new york',
    '30 w 22nd street new york',
    '30 w 23rd street new york',
    '30 w brown street new york',
    '30 w 1st new york',
    '30 w 2nd new york',
    '30 w 116th new york',
    '30 w 121st avenue new york',
    '30 w 121st road new york',
    '30 w 123rd road new york',
    '30 w 12th avenue new york',
    '30 w 151st road new york',
    '30 w 15th road new york',
    '30 w 16th avenue new york'
]


def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # check if followed by "avenue" or "road" or "street"
        if re.match(r'(avenue|road|street)', has_number.group('following')):
            return address # return original address
        # else add the "street" word
        else:
            new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address)
            return new_address
    else:
        return address # there is no number like 'th, st, nd, rd' -> return original address


# initialisation of the new list
new_address_list = []

# built the new clean list
for address in address_list:
    new_address_list.append(check_th_add_street(address))
    # or you could use it straight here i.e. :
    # address = check_th_add_street(address)
    # print address

# use the new list to do you work
for address in new_address_list:
    print "Formated address is : %s" % address # or what ever you want to do with 'address'

将输出:

Formated address is : 30 w 60th street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 21st street new york
Formated address is : 30 w 23rd street new york
Formated address is : 30 w 1231st street new york
Formated address is : 30 w 1452nd street new york
Formated address is : 30 w 1300th street new york
Formated address is : 30 w 1643rd street new york
Formated address is : 30 w 22nd street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 21st street new york
Formated address is : 30 w 22nd street new york
Formated address is : 30 w 23rd street new york
Formated address is : 30 w brown street new york
Formated address is : 30 w 1st street new york
Formated address is : 30 w 2nd street new york
Formated address is : 30 w 116th street new york
Formated address is : 30 w 121st avenue new york
Formated address is : 30 w 121st road new york
Formated address is : 30 w 123rd road new york
Formated address is : 30 w 12th avenue new york
Formated address is : 30 w 151st road new york
Formated address is : 30 w 15th road new york
Formated address is : 30 w 16th avenue new york

重新编辑

最终函数:将count参数添加到re.sub()

def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # check if followed by "avenue" or "road" or "street"
        if re.match(r'(avenue|road|street)', has_number.group('following')):
            return address # do nothing
        # else add the "street" word
        else:
            # then add the 'street' word
            new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address, 1) # the last parameter is the maximum number of pattern occurences to be replaced
            return new_address
    else:
        return address # there is no number like 'th, st, nd, rd'

答案 1 :(得分:1)

虽然你当然可以使用正则表达式解决这类问题,但我无法提供帮助,但我认为最有可能的Python库已经解决了这个问题。我从未使用过这些,但只是一些快速搜索找到了我:

https://github.com/datamade/usaddress

https://pypi.python.org/pypi/postal-address

https://github.com/SwoopSearch/pyaddress

PyParsing在这里也有一个地址示例,你可以看一下:http://pyparsing.wikispaces.com/file/view/streetAddressParser.py

您还可以查看以前的问题:is there a library for parsing US addresses?

您有什么理由不能使用第三方库来解决问题吗?

编辑:Pyparsing移动了他们的网址:https://github.com/pyparsing/pyparsing

答案 2 :(得分:0)

您可以通过将每个字符串转换为列表,并在这些列表中查找某些字符组来实现此目的。例如:

def check_th(address):
    addressList = list(address)
    for character in addressList:
        if character == 't':
             charIndex = addressList.index(character)
             if addressList[charIndex + 1] == 'h':
                 numberList = [addressList[charIndex - 2], addressList[charIndex - 1]]
                 return int(''.join(str(x) for x in numberList))

这看起来非常混乱,但它应该完成工作,只要数字长度为两位数。但是,如果你需要寻找很多东西,你应该寻找一种更方便,更简单的方法。

答案 3 :(得分:0)

要检查并添加单词street,只要街道号位于其名称之前,以下功能就可以使用:

def check_add_street(address):

    addressList = list(address)

    for character in addressList:
        if character == 't':
            charIndex_t = addressList.index(character)
            if addressList[charIndex_t + 1] == 'h':
                newIndex = charIndex_t + 1
                break

        elif character == 's':
            charIndex_s = addressList.index(character)
            if addressList[charIndex_s + 1] == 't':
                newIndex = charIndex_s + 1
                break

        elif character == 'n':
            charIndex_n = addressList.index(character)
            if addressList[charIndex_n + 1] == 'd':
                newIndex = charIndex_n + 1
                break

        elif character == 'r':
            charIndex_r = addressList.index(character)
            if addressList[charIndex_r + 1] == 'd':
                newIndex = charIndex_r + 1
                break

    if addressList[newIndex + 1] != ' ' or addressList[newIndex + 2] != 's' or addressList[newIndex + 3] != 't' or addressList[newIndex + 4] != 'r' or addressList[newIndex + 5] != 'e' or addressList[newIndex + 6] != 'e' or addressList[newIndex + 7] != 't' or addressList[newIndex + 8] != ' ':
        newAddressList = []

        for n in range(len(addressList)):
            while n <= newIndex:
                newAddressList.append(addressList[n])

        newAddressList.append(' ')
        newAddressList.append('s')
        newAddressList.append('t')
        newAddressList.append('r')
        newAddressList.append('e')
        newAddressList.append('e')
        newAddressList.append('t')

        for n in range(len(addressList) - newIndex):
            newAddressList.append(addressList[n + newIndex])

        return ''.join(str(x) for x in newAddressList)

    else:
        return ''.join(str(x) for x in addressList)

这将添加&#34; street&#34;如果它不存在,则鉴于您上面给出的格式是一致的。