我正在研究一些地址清理/地理编码软件,最近我遇到了一个特殊的地址格式,这对我来说有些问题。
我的外部地理编码模块无法找到30 w 60th new york
等地址(30 w 60th street new york
是地址的正确格式)。
基本上我需要做的是解析字符串并检查以下内容:
th
或st
或nd
或rd
后面是否有任何数字? (+跟随他们的空格)。 I.E 33rd
34th
21st
24th
street
这个词是什么?如果是的话,什么也不做。
如果不是,请在特定模式之后立即添加单词street
?
正则表达式是解决这种情况的最佳方法吗?
进一步澄清:我对其他地址后缀没有任何问题,例如大道,道路等等。我分析了非常大的数据集(我跑了大约12,000个地址/日期通过我的应用程序),以及遗漏street
的实例是导致我最头痛的原因。我已经研究过地址解析模块,例如usaddress,smartystreets等。我真的只需要为我所描述的具体问题提出一个干净的(希望是正则表达式的)解决方案。
我正在思考以下几点:
street
。如果是这样,什么都不做。[:targetword + len(targetword)] + 'street' + [:targetword + len(targetword)]
重建列表。 (targetword
将为47th
或字符串中的任何内容)我对正则表达式并不是最好的,所以我正在寻找一些意见。
感谢。
答案 0 :(得分:2)
看来你正在寻找正则表达式。 = P
这里有一些我特别为你构建的代码:
import re
def check_th_add_street(address):
# compile regexp rule
has_th_st_nd_rd = re.compile(r"(?P<number>[\d]{1,3}(st|nd|rd|th)\s)(?P<following>.*)")
# first check if the address has number followed by something like 'th, st, nd, rd'
has_number = has_th_st_nd_rd.search(address)
if has_number is not None:
# then check if not followed by 'street'
if re.match('street', has_number.group('following')) is None:
# then add the 'street' word
new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
return new_address
else:
return True # the format is good (followed by 'street')
else:
return True # there is no number like 'th, st, nd, rd'
我是蟒蛇学习者,所以,谢谢你让我知道它是否解决了你的问题。
测试了一小部分地址。
希望它有助于或引导您找到解决方案。
谢谢!
编辑
如果遵循“大道”或“道路”以及“街道”,则需要小心:
import re
def check_th_add_street(address):
# compile regexp rule
has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,3}(th|st|nd|rd)\s)(?P<following>.*)')
# first check if the address has number followed by something like 'th, st, nd, rd'
has_number = has_th_st_nd_rd.search(address)
if has_number is not None:
# check if followed by "avenue" or "road" or "street"
if re.match(r'(avenue|road|street)', has_number.group('following')):
return True # do nothing
# else add the "street" word
else:
# then add the 'street' word
new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
return new_address
else:
return True # there is no number like 'th, st, nd, rd'
<强> RE-修改
我根据您的需求做了一些改进,并添加了一个使用示例:
import re
# build the original address list includes bad format
address_list = [
'30 w 60th new york',
'30 w 60th new york',
'30 w 21st new york',
'30 w 23rd new york',
'30 w 1231st new york',
'30 w 1452nd new york',
'30 w 1300th new york',
'30 w 1643rd new york',
'30 w 22nd new york',
'30 w 60th street new york',
'30 w 60th street new york',
'30 w 21st street new york',
'30 w 22nd street new york',
'30 w 23rd street new york',
'30 w brown street new york',
'30 w 1st new york',
'30 w 2nd new york',
'30 w 116th new york',
'30 w 121st avenue new york',
'30 w 121st road new york',
'30 w 123rd road new york',
'30 w 12th avenue new york',
'30 w 151st road new york',
'30 w 15th road new york',
'30 w 16th avenue new york'
]
def check_th_add_street(address):
# compile regexp rule
has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')
# first check if the address has number followed by something like 'th, st, nd, rd'
has_number = has_th_st_nd_rd.search(address)
if has_number is not None:
# check if followed by "avenue" or "road" or "street"
if re.match(r'(avenue|road|street)', has_number.group('following')):
return address # return original address
# else add the "street" word
else:
new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address)
return new_address
else:
return address # there is no number like 'th, st, nd, rd' -> return original address
# initialisation of the new list
new_address_list = []
# built the new clean list
for address in address_list:
new_address_list.append(check_th_add_street(address))
# or you could use it straight here i.e. :
# address = check_th_add_street(address)
# print address
# use the new list to do you work
for address in new_address_list:
print "Formated address is : %s" % address # or what ever you want to do with 'address'
将输出:
Formated address is : 30 w 60th street new york Formated address is : 30 w 60th street new york Formated address is : 30 w 21st street new york Formated address is : 30 w 23rd street new york Formated address is : 30 w 1231st street new york Formated address is : 30 w 1452nd street new york Formated address is : 30 w 1300th street new york Formated address is : 30 w 1643rd street new york Formated address is : 30 w 22nd street new york Formated address is : 30 w 60th street new york Formated address is : 30 w 60th street new york Formated address is : 30 w 21st street new york Formated address is : 30 w 22nd street new york Formated address is : 30 w 23rd street new york Formated address is : 30 w brown street new york Formated address is : 30 w 1st street new york Formated address is : 30 w 2nd street new york Formated address is : 30 w 116th street new york Formated address is : 30 w 121st avenue new york Formated address is : 30 w 121st road new york Formated address is : 30 w 123rd road new york Formated address is : 30 w 12th avenue new york Formated address is : 30 w 151st road new york Formated address is : 30 w 15th road new york Formated address is : 30 w 16th avenue new york
重新编辑
最终函数:将count参数添加到re.sub()
def check_th_add_street(address):
# compile regexp rule
has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')
# first check if the address has number followed by something like 'th, st, nd, rd'
has_number = has_th_st_nd_rd.search(address)
if has_number is not None:
# check if followed by "avenue" or "road" or "street"
if re.match(r'(avenue|road|street)', has_number.group('following')):
return address # do nothing
# else add the "street" word
else:
# then add the 'street' word
new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address, 1) # the last parameter is the maximum number of pattern occurences to be replaced
return new_address
else:
return address # there is no number like 'th, st, nd, rd'
答案 1 :(得分:1)
虽然你当然可以使用正则表达式解决这类问题,但我无法提供帮助,但我认为最有可能的Python库已经解决了这个问题。我从未使用过这些,但只是一些快速搜索找到了我:
https://github.com/datamade/usaddress
https://pypi.python.org/pypi/postal-address
https://github.com/SwoopSearch/pyaddress
PyParsing在这里也有一个地址示例,你可以看一下:http://pyparsing.wikispaces.com/file/view/streetAddressParser.py
您还可以查看以前的问题:is there a library for parsing US addresses?
您有什么理由不能使用第三方库来解决问题吗?
编辑:Pyparsing移动了他们的网址:https://github.com/pyparsing/pyparsing
答案 2 :(得分:0)
您可以通过将每个字符串转换为列表,并在这些列表中查找某些字符组来实现此目的。例如:
def check_th(address):
addressList = list(address)
for character in addressList:
if character == 't':
charIndex = addressList.index(character)
if addressList[charIndex + 1] == 'h':
numberList = [addressList[charIndex - 2], addressList[charIndex - 1]]
return int(''.join(str(x) for x in numberList))
这看起来非常混乱,但它应该完成工作,只要数字长度为两位数。但是,如果你需要寻找很多东西,你应该寻找一种更方便,更简单的方法。
答案 3 :(得分:0)
要检查并添加单词street,只要街道号位于其名称之前,以下功能就可以使用:
def check_add_street(address):
addressList = list(address)
for character in addressList:
if character == 't':
charIndex_t = addressList.index(character)
if addressList[charIndex_t + 1] == 'h':
newIndex = charIndex_t + 1
break
elif character == 's':
charIndex_s = addressList.index(character)
if addressList[charIndex_s + 1] == 't':
newIndex = charIndex_s + 1
break
elif character == 'n':
charIndex_n = addressList.index(character)
if addressList[charIndex_n + 1] == 'd':
newIndex = charIndex_n + 1
break
elif character == 'r':
charIndex_r = addressList.index(character)
if addressList[charIndex_r + 1] == 'd':
newIndex = charIndex_r + 1
break
if addressList[newIndex + 1] != ' ' or addressList[newIndex + 2] != 's' or addressList[newIndex + 3] != 't' or addressList[newIndex + 4] != 'r' or addressList[newIndex + 5] != 'e' or addressList[newIndex + 6] != 'e' or addressList[newIndex + 7] != 't' or addressList[newIndex + 8] != ' ':
newAddressList = []
for n in range(len(addressList)):
while n <= newIndex:
newAddressList.append(addressList[n])
newAddressList.append(' ')
newAddressList.append('s')
newAddressList.append('t')
newAddressList.append('r')
newAddressList.append('e')
newAddressList.append('e')
newAddressList.append('t')
for n in range(len(addressList) - newIndex):
newAddressList.append(addressList[n + newIndex])
return ''.join(str(x) for x in newAddressList)
else:
return ''.join(str(x) for x in addressList)
这将添加&#34; street&#34;如果它不存在,则鉴于您上面给出的格式是一致的。