我有要标准化的地址数据。
这包括清洁元素,例如Rd
至Road
和Dr
至Drive
。
但是,我完全困惑于如何区分Street和Saint。它们都具有缩写St
。
有人以前做过这样的事情吗?任何想法如何解决它?
到目前为止,我的代码(改编自here)
在最后一行中观看st mary's road
:
import re
import pandas as pd
# set up a df with fake addresses:
adds = pd.DataFrame({'address':['1 main st','2 garden dr.','4 foo apts','7 orchard gdns','st mary\'s road']})
print(adds)
address
0 1 main st
1 2 garden dr.
2 4 foo apts
3 7 orchard gdns
4 st mary's road
# set up a dictionary of names to change
def suffixDict():
return {'dr': 'drive',
'rd': 'road',
'st':'Street', # or 'st':'Saint' ??
'apts':'apartments',
'gdns':'gardens'}
# function to fix suffixes
def normalizeStreetSuffixes(inputValue):
abbv = suffixDict() # get dict
words = inputValue.split() # split address line
for i,word in enumerate(words):
w = word.lower() # lowercase
w = re.sub(r'[^\w\'\s]*','', w) # remove some special characters
rep = abbv[w] if w in abbv.keys() else words[i] # check dict
words[i] = (rep[0].upper() + rep[1:]) # proper case
return ' '.join(words) # return cleaned address line
# apply function to address data
adds.address.apply(normalizeStreetSuffixes)
0 1 Main Street
1 2 Garden Drive
2 4 Foo Apartments
3 7 Orchard Gardens
4 Street Mary's Road
您可以看到“圣玛丽之路”已更改为“玛丽街之路”。