如何在pandas

时间:2018-05-15 09:03:47

标签: python regex pandas

我有以下数据框,其中包含address列,

df = pd.DataFrame(index=np.arange(10))
df["address"] = "Iso Omena 8 a 2"

需要将其拆分为不同的列,以便生成的数据框如下:

address          street_name  building_number door_number_letter appartment_numner
Iso Omena 8 a 2  Iso Omena    8                  a                2

让它变得棘手的是:

1.names之间可能有或没有空格,如上例所示。

2.door_number_letter有时可能不是字母。 (例如," Iso Omena 8 5 2")

address最完整的表单是:[address,street_name,building_number,door_number_letter,appartment_numner]

4 个答案:

答案 0 :(得分:2)

假设地址仅为字母和空格,其余为空格分隔,而建筑物编号始终以数字开头,这可以通过以下方式实现:

import re
s = ['Iso Omena 8 a 2', 'Xstreet 2']
for addr in s:
    street = re.findall('[a-zA-Z ]*', addr)[0].strip()
    rest = addr[len(street):].strip().split(' ')
    print(street, rest)

Iso Omena ['8', 'a', '2']
Xstreet ['2']

或者,如果您想将所有内容放在一个数据框中:

df = pd.DataFrame()

df['address'] = ['Iso Omena 8 a 2', 'Xstreet 2', 'Asdf 7 c']

df['street'] = None; df['building'] = None; df['door'] = None; df['appartment'] = None
import re
for i, s in enumerate(df['address']):
    street = re.findall('[a-zA-Z ]*', s)[0].strip()
    df.loc[i,('street')] = street
    for col, val in zip(['building', 'door', 'appartment'], s[len(street):].strip().split(' ')):
        df.loc[i,(col)] = val

In: df
Out:
           address     street building  door appartment
0  Iso Omena 8 a 2  Iso Omena        8     a          2
1        Xstreet 2    Xstreet        2  None       None
2         Asdf 7 c       Asdf        7     c       None

编辑:仅在'-ignign:

左侧的建筑物编号

您可以通过

替换df.loc[i,(col)] = val
df.loc[i,(col)] = re.findall('[^-]*', val)[0]

如果这也适合门和公寓。否则你必须对col =='building'进行if-test才能使用这个版本。

答案 1 :(得分:2)

您可以使用:

In [116]: s1 = df.address.str.findall(r'([\w ]+?) +(\d+) +([\d\w]+) +(\d+)').map(lambda s: s[0])

In [117]: s1
Out[117]: 
0    (Iso Omena, 8, a, 2)
1    (Iso Omena, 8, a, 2)
2    (Iso Omena, 8, a, 2)
3    (Iso Omena, 8, a, 2)
4    (Iso Omena, 8, a, 2)
5    (Iso Omena, 8, a, 2)
6    (Iso Omena, 8, a, 2)
7    (Iso Omena, 8, a, 2)
8    (Iso Omena, 8, a, 2)
9    (Iso Omena, 8, a, 2)
Name: address, dtype: object

然后根据这些列构建数据框:

In [118]: pd.DataFrame(s1.values.tolist(), index=s1.index, columns=['street_name', 'building_number', 'door_number_letter', 'appartment_numner'])
Out[118]: 
  street_name building_number door_number_letter appartment_numner
0   Iso Omena               8                  a                 2
1   Iso Omena               8                  a                 2
2   Iso Omena               8                  a                 2
3   Iso Omena               8                  a                 2
4   Iso Omena               8                  a                 2
5   Iso Omena               8                  a                 2
6   Iso Omena               8                  a                 2
7   Iso Omena               8                  a                 2
8   Iso Omena               8                  a                 2
9   Iso Omena               8                  a                 2

答案 2 :(得分:2)

this回答中得到一些启发,我想出了这个正则表达式+提取解决方案:

In [77]: df.address.iloc[1] = 'Big Apple 19 21 7'

In [78]: df.address.str.extract('(?P<street>^[^0-9]*) (?P<building>.+?) (?P<door>.+?) (?P<apartment>.+?$)')

Out[78]: 
  street building door apartment
0  Iso Omena        8    a         2
1  Big Apple       19   21         7    
2  Iso Omena        8    a         2
3  Iso Omena        8    a         2 
4  Iso Omena        8    a         2
5  Iso Omena        8    a         2
6  Iso Omena        8    a         2
7  Iso Omena        8    a         2
8  Iso Omena        8    a         2
9  Iso Omena        8    a         2

答案 3 :(得分:1)

这样的东西?

import re

addr = "Iso Omena 8 a 2"

pattern = r'[a-öA-Ö]{3,100} *[a-öA-Ö]{3,100}'
street = re.findall(pattern, addr)[0]

bda = addr[len(street):].split()
print(street, bda,addr[len(street):])