我知道这个问题一直都会出现。但我找到的潜在解决方案是PHP或Java,我不知道。我需要这个才能使用Python。
我有这种格式的街道:
df = pd.DataFrame({'street':[
'ABC Street',
'ABC Street 1',
'SDF Street 1a',
'KSD Street 30 a',
'URR-AC Place 1-5'
]})
而且,令人惊讶的是,需要将它们分开,所以我最终得到:
street number
0 ABC Street NaN
1 ABC Street 1
2 SDF Street 1a
3 KSD Street 30 a
4 URR-AC Place 1-5
我的想法并不新鲜。从字符串的末尾开始查找,直到找到最后一个数字并在那里拆分字符串。但是使用str.split我可以拆分,但#4不起作用。我想这是一个正则表达式问题,但我对此一无所知。
答案 0 :(得分:0)
Ok, for my special case, I seem to have found an answer.
First, I am making sure there are no leading/tracing spaces:
df.street= df.street.str.strip()
Then, I am extracting the street name. What the Regex does is to look for one or more non-digit characters. So once it hits the first digit it stops and thus gives me the name:
df['street_name'] = df.street.str.extract('(\D+)', expand=False)
To separate the the number, I am using the same functionality. But here I am looking for the first digit to appear and any following character.
df['number'] = df.street.str.extract('(\d+.*)', expand=False)
This then results in the following dataframe:
street street_name number
0 ABC Street ABC Street NaN
1 ABC Street 1 ABC Street 1
2 SDF Street 1a SDF Street 1a
3 KSD Street 30 a KSD Street 30 a
4 URR-AC Place 1-5 URR-AC Place 1-5
Caution: This will fail when you have a street name like "Strasse-des-17. Juli, 5", where a number is part of the name.