Question

需要一些帮助来根据 Pandas 中的正则表达式拆分字段并创建数据框。

<头>

A	B	C
1129	2021 年 4 月 19 日	邮政编码详细信息：城市：亨茨维尔_阿拉巴马州，邮编：35808，801thru816 城市：安克雷奇_阿拉斯加州，邮编：99506，501thru524
1139	2021 年 4 月 20 日	邮政编码详细信息：城市：Miami_Florida，邮编：33128，124thru190 城市：Atlanta_Georgia，邮编：30301，301thru381

在 C 列之一中，需要提取多个 City & Zip Code 详细信息并在以下格式：

<头>

没有	日期	城市	邮编
1129	2021 年 4 月 19 日	亨茨维尔_阿拉巴马州	35808
1129	2021 年 4 月 19 日	安克雷奇_阿拉斯加	99506
1139	2021 年 4 月 20 日	迈阿密_佛罗里达	33128
1139	2021 年 4 月 20 日	亚特兰大_乔治亚	30301

我的 re.findall 表达式如下并且工作正常：

city_regex_extract = r" [a-z|A-Z|0-9|_]*\_[a-z|A-Z|0-9|_]*"    (https://regex101.com/r/VM8oFF/1)
zip_regex_extract = r"[0-9]{5}"                            (https://regex101.com/r/oBYJZX/1)

以下是迄今为止的代码，但无法将 Zip 字段添加到相同的代码中。

import pandas as pd
import json, re, sys, time


df = pd.DataFrame({
   'No': ['1129', '1139'],
   'Date': ['19-APR-2021','20-APR-2021'],
   'C': ['Zip Code Details: City: Huntsville_Alabama , Zip: 35808 , 801thru816  City: Anchorage_Alaska , Zip: 99506 , 501thru524','Zip Code Details: City: Miami_Florida , Zip: 33128 , 124thru190  City: Atlanta_Georgia , Zip: 30301 , 301thru381'] 
})


city_regex_extract = r" [a-z|A-Z|0-9|_]*\_[a-z|A-Z|0-9|_]*"
zip_regex_extract = r"[0-9]{17}"


df['City'] =  [re.findall(city_regex_extract, str(x)) for x in df['C']]
df['Zip'] =  [re.findall(zip_regex_extract, str(x)) for x in df['C']]

df = (df
.set_index(['No','Date'])['City']
.apply(pd.Series)
.stack()
.reset_index()
.drop('level_2', axis=1)
.rename(columns={0:'City'}))

print(df)

感谢任何帮助。

Answer 1

`Series.str.extractall`

s = df['C'].str.extractall(r'City:\s*(?P<City>[^,]+?)\s*,\s*Zip:\s*(?P<Zip>\d+)')
df[['No', 'Date']].join(s.droplevel(1))

     No         Date                City    Zip
0  1129  19-APR-2021  Huntsville_Alabama  35808
0  1129  19-APR-2021    Anchorage_Alaska  99506
1  1139  20-APR-2021       Miami_Florida  33128
1  1139  20-APR-2021     Atlanta_Georgia  30301

正则表达式详情：

City: ：逐字匹配字符 City:
\s* ：匹配零个或多个空白字符
(?P<City>[^,]+?)：第一个命名的捕获组
- [^,]+?：匹配任何期望 , 一次或多次但尽可能少的字符
\s*,\s* ：匹配零个或多个空格后跟逗号后跟零个或多个空格
Zip: ：逐字匹配字符 Zip:
\s* ：匹配零个或多个空白字符
(?P<Zip>\d+)：第二个命名的捕获组
- \d+：匹配一个数字一次或多次

查看在线regex demo

Answer 2

在我看来，您实际上甚至不需要正则表达式库，pandas 已包含正则表达式，因此您可以拆分：

df['C'] = df['C'].str.split(' City: ').str[1:]
df = df.explode('C')
df[['City','Zip']] = df['C'].str.split(' , Zip: | , ', expand=True).iloc[:,:2]

print(df)

     No         Date                City    Zip
0  1129  19-APR-2021  Huntsville_Alabama  35808
0  1129  19-APR-2021    Anchorage_Alaska  99506
1  1139  20-APR-2021       Miami_Florida  33128
1  1139  20-APR-2021     Atlanta_Georgia  30301

expand=True 参数允许一次检索多个列。 .iloc[] 用于选择发生拆分后要使用的值。

Answer 3

在 .explode() 和 City 上尝试 Zip，然后是 reset_index()，最后在索引上加入两个爆炸结果

df.explode('City').reset_index()[['No', 'Date', 'City']]\
    .join(df.explode('Zip').reset_index()[['Zip']])

Pandas：将一列分解成多行

3 个答案:

`Series.str.extractall`