我有一个包含6000条记录的数据框,需要使用streetname将列提取/拆分为:“Streetname”,“Housingnumber”和“Adjectives”。不幸的是,使用正则表达式函数还没有解决问题,因为df [“streetname”]的符号中没有结构:
**Input from df["Streetname"]**
St. edward's Lane 26
Vineyardlane3a
High Street 0-9
ParkRoad near #33
Queens Road ??
s-Georgelane9abc
Kings Road 9b
1st Park Avenue 67 near cyclelane
**我想要的输出:
df["Street"] df["housingnumber"] df["adjective"]**
St. Edward's lane 26
Vineyardlane 3 a
High Street 0-9
ParkRoad 33
Queens Road
s-Georgelane 9 abc
Kings Road 9 b
1st Park Avenue 67
我试过了:
Filter = r'(?P<S>.*)(?P<H>\s[0-9].*)'
df["Streetname"] = df["Streetname"].str.extract(Filter)
我丢失了大量数据,结果只写入一栏......希望有人可以提供帮助!
答案 0 :(得分:0)
不是100%完美(我怀疑没有数据库或机器学习算法这是可能的)但是一个起点:
^ # start of line/string
(?P<street>\w+?\D+) # [a-zA-Z0-9_]+? followed by not a number
(?P<nr>\d*[-\d]*) # a digit, followed by - and other digits, eventually
(?P<adjective>[a-zA-Z]*) # a-z
.* # consume the rest of the string
#
结束时删除?
,空格或street
。