清理地址数据-替换“圣”与“街”和“圣”与“圣” ...如何区分?

时间:2019-08-09 14:58:01

标签: python pandas

我有要标准化的地址数据。

这包括清洁元素,例如RdRoadDrDrive

但是,我完全困惑于如何区分Street和Saint。它们都具有缩写St

有人以前做过这样的事情吗?任何想法如何解决它?

到目前为止,我的代码(改编自here) 在最后一行中观看st mary's road

import re
import pandas as pd

# set up a df with fake addresses:
adds = pd.DataFrame({'address':['1 main st','2 garden dr.','4 foo apts','7 orchard gdns','st mary\'s road']})
print(adds)

          address
0       1 main st
1    2 garden dr.
2      4 foo apts
3  7 orchard gdns
4  st mary's road
# set up a dictionary of names to change
def suffixDict():

    return {'dr': 'drive',
            'rd': 'road',
            'st':'Street', # or 'st':'Saint' ??
            'apts':'apartments',
            'gdns':'gardens'}

# function to fix suffixes
def normalizeStreetSuffixes(inputValue):

        abbv = suffixDict() # get dict
        words = inputValue.split() # split address line
        for i,word in enumerate(words):
            w = word.lower() # lowercase
            w = re.sub(r'[^\w\'\s]*','', w) # remove some special characters
            rep = abbv[w] if w in abbv.keys() else words[i] # check dict
            words[i] = (rep[0].upper() + rep[1:]) # proper case
        return ' '.join(words) # return cleaned address line


# apply function to address data
adds.address.apply(normalizeStreetSuffixes)

0         1 Main Street
1        2 Garden Drive
2      4 Foo Apartments
3     7 Orchard Gardens
4    Street Mary's Road

您可以看到“圣玛丽之路”已更改为“玛丽街之路”。

0 个答案:

没有答案