如何使用正则表达式函数从大熊猫中提取欧洲街道名称,住房号码和形容词?

时间:2018-05-31 10:58:14

标签: python regex pandas dataframe street-address

我有一个包含6000条记录的数据框,需要使用streetname将列提取/拆分为:“Streetname”,“Housingnumber”和“Adjectives”。不幸的是,使用正则表达式函数还没有解决问题,因为df [“streetname”]的符号中没有结构:

**Input from df["Streetname"]**

St. edward's Lane 26

Vineyardlane3a

High Street 0-9

ParkRoad near #33

Queens Road ??

s-Georgelane9abc

Kings Road 9b

1st Park Avenue 67 near cyclelane 

**我想要的输出:

df["Street"]                    df["housingnumber"]             df["adjective"]**

St. Edward's lane               26

Vineyardlane                    3                               a

High Street                     0-9

ParkRoad                        33

Queens Road                    

s-Georgelane                    9                               abc

Kings Road                      9                               b 

1st Park Avenue                 67

我试过了:

Filter = r'(?P<S>.*)(?P<H>\s[0-9].*)'

df["Streetname"] = df["Streetname"].str.extract(Filter)

我丢失了大量数据,结果只写入一栏......希望有人可以提供帮助!

1 个答案:

答案 0 :(得分:0)

不是100%完美(我怀疑没有数据库或机器学习算法这是可能的)但是一个起点:

^                         # start of line/string
(?P<street>\w+?\D+)       # [a-zA-Z0-9_]+? followed by not a number
(?P<nr>\d*[-\d]*)         # a digit, followed by - and other digits, eventually
(?P<adjective>[a-zA-Z]*)  # a-z
.*                        # consume the rest of the string

请参阅a demo on regex101.com


您可能希望之后从#结束时删除?,空格或street