一个以全部大写字母开头的代码块的正则表达式?

时间:2014-04-30 16:20:18

标签: python regex text text-files

我有一个文件显示以下结构:

.
.
LAST_NAME_IN_CAPS_1, First_1    Cell Phone: 999-999-999
Address
Needham MA 02135
Status: Attender    Marital:    Married Adult:  Y   M/F:    M   Env.No.:

Salutation:


LAST_NAME_IN_CAPS_2, First_2    Cell Phone: 999-999-999
Address 
Needham MA 02135
E-mail :    email@gmail.com
Status: Attender    Marital:    Married Adult:  Y   M/F:    F   Env.No.:

Salutation:
.
.

行之间的信息变化很大,我想要的是获得一个正则表达式,该表达式采用大于3个字母的两个上限字之间的代码块,包括第一个上限字。在这种情况下,我想包括LAST_NAME_IN_CAPS_1以及LAST_NAME_IN_CAPS_2之前的所有内容,正则表达式可以处理哪些内容?

2 个答案:

答案 0 :(得分:1)

仅假设ASCII字母,并且您的数据位于名为text的变量中,您应该可以使用以下内容:

import re
matches = re.findall(r'^[A-Z]{3}.*?(?=^[A-Z]{3}|\Z)', text, re.S | re.M)

re.S(或re.DOTALL)使.匹配换行符,re.M(或re.MULTILINE)标志使其成为^$将分别在行的开头和结尾处匹配,而不是仅在字符串的开头和结尾处匹配。 \Z是字符串末尾的锚点,(?=...)是一个积极的预测。

所以这里是这个正则表达式的描述:
从以三个大写字符(^[A-Z]{3})开头的行开始匹配,然后匹配任意数量的字符(尽可能少),包括换行符(.*?),直到您能够匹配开头一行以三个大写字符开头,或者你已到达字符串的末尾((?=^[A-Z]{3}|\Z))。

答案 1 :(得分:1)

使用一个简单的正则表达式来检测感兴趣的行,然后手动拆分这些错误,这可能会更容易。

鉴于此测试字符串:

txt='''\
.
.
LAST_NAME_IN_CAPS_1, First_1    Cell Phone: 999-999-999
Address
Needham MA 02135
Status: Attender    Marital:    Married Adult:  Y   M/F:    M   Env.No.:

Salutation 1:

LAST_NAME_IN_CAPS_2, First_2    Cell Phone: 999-999-999
Address
Needham MA 02135
Status: Attender    Marital:    Married Adult:  Y   M/F:    M   Env.No.:

Salutation 2:


LAST_NAME_IN_CAPS_3, First_3    Cell Phone: 999-999-999
Address 
Needham MA 02135
E-mail :    email@gmail.com
Status: Attender    Marital:    Married Adult:  Y   M/F:    F   Env.No.:

Salutation 3:
.
.'''

尝试:

idx=[m.start(1) for m in re.finditer(r'^([A-Z_0-9]+,\s+)', txt, re.S | re.M)]
print [txt[i:j] for i,j in zip([0]+idx, idx+[None])[1:]]

打印:

['LAST_NAME_IN_CAPS_1, First_1    Cell Phone: 999-999-999\nAddress\nNeedham MA 02135\nStatus: Attender    Marital:    Married Adult:  Y   M/F:    M   Env.No.:\n\nSalutation 1:\n\n', 
 'LAST_NAME_IN_CAPS_2, First_2    Cell Phone: 999-999-999\nAddress\nNeedham MA 02135\nStatus: Attender    Marital:    Married Adult:  Y   M/F:    M   Env.No.:\n\nSalutation 2:\n\n\n', 
 'LAST_NAME_IN_CAPS_3, First_3    Cell Phone: 999-999-999\nAddress \nNeedham MA 02135\nE-mail :    email@gmail.com\nStatus: Attender    Marital:    Married Adult:  Y   M/F:    F   Env.No.:\n\nSalutation 3:\n.\n.']

注意:我只使用r'^([A-Z_0-9]+,\s+)'来匹配示例中的模式;如果你有全部大写'匹配不同的模式,显然使用它。


对于只有Python的解决方案(没有正则表达式),你可以这样做:

lines=txt.splitlines()

line_idx=[i for i, line in enumerate(lines) 
                     if line.partition(',')[0].isupper()]

print [lines[i:j] for i,j in zip([0]+line_idx, line_idx+[None])][1:]         

打印:

[['LAST_NAME_IN_CAPS_1, First_1    Cell Phone: 999-999-999', 'Address', 'Needham MA 02135', 'Status: Attender    Marital:    Married Adult:  Y   M/F:    M   Env.No.:', '', 'Salutation 1:', ''], 
 ['LAST_NAME_IN_CAPS_2, First_2    Cell Phone: 999-999-999', 'Address', 'Needham MA 02135', 'Status: Attender    Marital:    Married Adult:  Y   M/F:    M   Env.No.:', '', 'Salutation 2:', '', ''], 
 ['LAST_NAME_IN_CAPS_3, First_3    Cell Phone: 999-999-999', 'Address ', 'Needham MA 02135', 'E-mail :    email@gmail.com', 'Status: Attender    Marital:    Married Adult:  Y   M/F:    F   Env.No.:', '', 'Salutation 3:', '.', '.']]

非正则表达式版本的优点是它更容易支持国际字符:

# -*- coding: utf-8 -*-

txt='''\
ABÇ, 
ABC, 
abc, 
ĖFG, '''

print [i for i, line in enumerate(txt.splitlines()) 
               if line.partition(',')[0].isupper()]
# [0, 1, 3]