Question

我正在尝试从熊猫df中的多个括号中提取多个字符串并创建新列。

以下字符串位于df的一列中：

Unfurnished 1 Bdrm 1st flr Flat. Hall. Lounge. Kitch. Bdrm. Shower rm (CT band - A). Deposit & references required. No pets. No smokers. Rent £500 p.m Entry by arr. Viewing Owner 07425 163047 or contact solicitors. Landlord reg: 305350/110/22531. (EPC band - C).

我一直在尝试在2个新列中提取CT波段和EPC波段数据（分别用于每组信息）。我尝试了该代码的多种版本，还尝试使用https://regex101.com/r/5XjNqh/1

中的信息

例如：下面的代码

properties['Council_tax']=properties.Description.str.extract('(\(CT[^()*&?%])',expand=False)

返回

(CT

预期输出：

| Description        | Council_tax_band | EPC_band |
|--------------------|------------------|----------|
| Above string       |        A         |     C    | 
| Example string 2   |        B         |     F    |
| Example string 3   |        C         |     D    |

与此同时，单词'Band'也被发现为'band'。

我不认为我对此处正确使用正则表达式有很好的了解。有什么想法吗？

Answer 1

您可以使用

df['Council_tab_band'] = df['Description'].str.extract(r'(?i)\(CT\s+band\s*-\s*([^()]+)\)', expand=False)
df['EPC_band'] = df['Description'].str.extract(r'(?i)\(EPC\s+band\s*-\s*([^()]+)\)', expand=False)

请参见regex demo #1和regex demo #2

正则表达式详细信息

(?i)-不区分大小写的修饰符
\(-一个(字符
EPC-字符串
\s+-超过1个空格
band-单词band
\s*-\s*-用空格括起来的连字符
([^()]+)-第1组：除(和)以外的任何1个或更多字符
\)-一个)字符。

在大熊猫中使用正则表达式从多个括号中提取字符串

1 个答案: