我有以下数据框,其中包含address
列,
df = pd.DataFrame(index=np.arange(10))
df["address"] = "Iso Omena 8 a 2"
需要将其拆分为不同的列,以便生成的数据框如下:
address street_name building_number door_number_letter appartment_numner
Iso Omena 8 a 2 Iso Omena 8 a 2
让它变得棘手的是:
1.names之间可能有或没有空格,如上例所示。
2.door_number_letter有时可能不是字母。 (例如," Iso Omena 8 5 2")
address
最完整的表单是:[address,street_name,building_number,door_number_letter,appartment_numner]
答案 0 :(得分:2)
假设地址仅为字母和空格,其余为空格分隔,而建筑物编号始终以数字开头,这可以通过以下方式实现:
import re
s = ['Iso Omena 8 a 2', 'Xstreet 2']
for addr in s:
street = re.findall('[a-zA-Z ]*', addr)[0].strip()
rest = addr[len(street):].strip().split(' ')
print(street, rest)
Iso Omena ['8', 'a', '2']
Xstreet ['2']
或者,如果您想将所有内容放在一个数据框中:
df = pd.DataFrame()
df['address'] = ['Iso Omena 8 a 2', 'Xstreet 2', 'Asdf 7 c']
df['street'] = None; df['building'] = None; df['door'] = None; df['appartment'] = None
import re
for i, s in enumerate(df['address']):
street = re.findall('[a-zA-Z ]*', s)[0].strip()
df.loc[i,('street')] = street
for col, val in zip(['building', 'door', 'appartment'], s[len(street):].strip().split(' ')):
df.loc[i,(col)] = val
In: df
Out:
address street building door appartment
0 Iso Omena 8 a 2 Iso Omena 8 a 2
1 Xstreet 2 Xstreet 2 None None
2 Asdf 7 c Asdf 7 c None
编辑:仅在'-ignign:
左侧的建筑物编号您可以通过
替换df.loc[i,(col)] = val
df.loc[i,(col)] = re.findall('[^-]*', val)[0]
如果这也适合门和公寓。否则你必须对col =='building'进行if-test才能使用这个版本。
答案 1 :(得分:2)
您可以使用:
In [116]: s1 = df.address.str.findall(r'([\w ]+?) +(\d+) +([\d\w]+) +(\d+)').map(lambda s: s[0])
In [117]: s1
Out[117]:
0 (Iso Omena, 8, a, 2)
1 (Iso Omena, 8, a, 2)
2 (Iso Omena, 8, a, 2)
3 (Iso Omena, 8, a, 2)
4 (Iso Omena, 8, a, 2)
5 (Iso Omena, 8, a, 2)
6 (Iso Omena, 8, a, 2)
7 (Iso Omena, 8, a, 2)
8 (Iso Omena, 8, a, 2)
9 (Iso Omena, 8, a, 2)
Name: address, dtype: object
然后根据这些列构建数据框:
In [118]: pd.DataFrame(s1.values.tolist(), index=s1.index, columns=['street_name', 'building_number', 'door_number_letter', 'appartment_numner'])
Out[118]:
street_name building_number door_number_letter appartment_numner
0 Iso Omena 8 a 2
1 Iso Omena 8 a 2
2 Iso Omena 8 a 2
3 Iso Omena 8 a 2
4 Iso Omena 8 a 2
5 Iso Omena 8 a 2
6 Iso Omena 8 a 2
7 Iso Omena 8 a 2
8 Iso Omena 8 a 2
9 Iso Omena 8 a 2
答案 2 :(得分:2)
从this回答中得到一些启发,我想出了这个正则表达式+提取解决方案:
In [77]: df.address.iloc[1] = 'Big Apple 19 21 7'
In [78]: df.address.str.extract('(?P<street>^[^0-9]*) (?P<building>.+?) (?P<door>.+?) (?P<apartment>.+?$)')
Out[78]:
street building door apartment
0 Iso Omena 8 a 2
1 Big Apple 19 21 7
2 Iso Omena 8 a 2
3 Iso Omena 8 a 2
4 Iso Omena 8 a 2
5 Iso Omena 8 a 2
6 Iso Omena 8 a 2
7 Iso Omena 8 a 2
8 Iso Omena 8 a 2
9 Iso Omena 8 a 2
答案 3 :(得分:1)
这样的东西?
import re
addr = "Iso Omena 8 a 2"
pattern = r'[a-öA-Ö]{3,100} *[a-öA-Ö]{3,100}'
street = re.findall(pattern, addr)[0]
bda = addr[len(street):].split()
print(street, bda,addr[len(street):])