Question

我知道这个问题一直都会出现。但我找到的潜在解决方案是PHP或Java，我不知道。我需要这个才能使用Python。

我有这种格式的街道：

df = pd.DataFrame({'street':[
    'ABC Street',
    'ABC Street 1',
    'SDF Street 1a',
    'KSD Street 30 a',
    'URR-AC Place 1-5'
]})

而且，令人惊讶的是，需要将它们分开，所以我最终得到：

   street       number
0  ABC Street   NaN
1  ABC Street   1
2  SDF Street   1a 
3  KSD Street   30 a
4  URR-AC Place 1-5

我的想法并不新鲜。从字符串的末尾开始查找，直到找到最后一个数字并在那里拆分字符串。但是使用str.split我可以拆分，但＃4不起作用。我想这是一个正则表达式问题，但我对此一无所知。

Answer 1

Ok, for my special case, I seem to have found an answer.

First, I am making sure there are no leading/tracing spaces:

df.street= df.street.str.strip()

Then, I am extracting the street name. What the Regex does is to look for one or more non-digit characters. So once it hits the first digit it stops and thus gives me the name:

df['street_name'] = df.street.str.extract('(\D+)', expand=False)

To separate the the number, I am using the same functionality. But here I am looking for the first digit to appear and any following character.

df['number'] = df.street.str.extract('(\d+.*)', expand=False)

This then results in the following dataframe:

  street             street_name    number
0 ABC Street         ABC Street     NaN 
1 ABC Street 1       ABC Street     1 
2 SDF Street 1a      SDF Street     1a 
3 KSD Street 30 a    KSD Street     30 a 
4 URR-AC Place 1-5   URR-AC Place   1-5

Caution: This will fail when you have a street name like "Strasse-des-17. Juli, 5", where a number is part of the name.

与熊猫分开街道和数字

1 个答案: