我有一个文件显示以下结构:
.
.
LAST_NAME_IN_CAPS_1, First_1 Cell Phone: 999-999-999
Address
Needham MA 02135
Status: Attender Marital: Married Adult: Y M/F: M Env.No.:
Salutation:
LAST_NAME_IN_CAPS_2, First_2 Cell Phone: 999-999-999
Address
Needham MA 02135
E-mail : email@gmail.com
Status: Attender Marital: Married Adult: Y M/F: F Env.No.:
Salutation:
.
.
行之间的信息变化很大,我想要的是获得一个正则表达式,该表达式采用大于3个字母的两个上限字之间的代码块,包括第一个上限字。在这种情况下,我想包括LAST_NAME_IN_CAPS_1以及LAST_NAME_IN_CAPS_2之前的所有内容,正则表达式可以处理哪些内容?
答案 0 :(得分:1)
仅假设ASCII字母,并且您的数据位于名为text
的变量中,您应该可以使用以下内容:
import re
matches = re.findall(r'^[A-Z]{3}.*?(?=^[A-Z]{3}|\Z)', text, re.S | re.M)
re.S
(或re.DOTALL
)使.
匹配换行符,re.M
(或re.MULTILINE
)标志使其成为^
和$
将分别在行的开头和结尾处匹配,而不是仅在字符串的开头和结尾处匹配。 \Z
是字符串末尾的锚点,(?=...)
是一个积极的预测。
所以这里是这个正则表达式的描述:
从以三个大写字符(^[A-Z]{3}
)开头的行开始匹配,然后匹配任意数量的字符(尽可能少),包括换行符(.*?
),直到您能够匹配开头一行以三个大写字符开头,或者你已到达字符串的末尾((?=^[A-Z]{3}|\Z)
)。
答案 1 :(得分:1)
使用一个简单的正则表达式来检测感兴趣的行,然后手动拆分这些错误,这可能会更容易。
鉴于此测试字符串:
txt='''\
.
.
LAST_NAME_IN_CAPS_1, First_1 Cell Phone: 999-999-999
Address
Needham MA 02135
Status: Attender Marital: Married Adult: Y M/F: M Env.No.:
Salutation 1:
LAST_NAME_IN_CAPS_2, First_2 Cell Phone: 999-999-999
Address
Needham MA 02135
Status: Attender Marital: Married Adult: Y M/F: M Env.No.:
Salutation 2:
LAST_NAME_IN_CAPS_3, First_3 Cell Phone: 999-999-999
Address
Needham MA 02135
E-mail : email@gmail.com
Status: Attender Marital: Married Adult: Y M/F: F Env.No.:
Salutation 3:
.
.'''
尝试:
idx=[m.start(1) for m in re.finditer(r'^([A-Z_0-9]+,\s+)', txt, re.S | re.M)]
print [txt[i:j] for i,j in zip([0]+idx, idx+[None])[1:]]
打印:
['LAST_NAME_IN_CAPS_1, First_1 Cell Phone: 999-999-999\nAddress\nNeedham MA 02135\nStatus: Attender Marital: Married Adult: Y M/F: M Env.No.:\n\nSalutation 1:\n\n',
'LAST_NAME_IN_CAPS_2, First_2 Cell Phone: 999-999-999\nAddress\nNeedham MA 02135\nStatus: Attender Marital: Married Adult: Y M/F: M Env.No.:\n\nSalutation 2:\n\n\n',
'LAST_NAME_IN_CAPS_3, First_3 Cell Phone: 999-999-999\nAddress \nNeedham MA 02135\nE-mail : email@gmail.com\nStatus: Attender Marital: Married Adult: Y M/F: F Env.No.:\n\nSalutation 3:\n.\n.']
注意:我只使用r'^([A-Z_0-9]+,\s+)'
来匹配示例中的模式;如果你有全部大写'匹配不同的模式,显然使用它。
对于只有Python的解决方案(没有正则表达式),你可以这样做:
lines=txt.splitlines()
line_idx=[i for i, line in enumerate(lines)
if line.partition(',')[0].isupper()]
print [lines[i:j] for i,j in zip([0]+line_idx, line_idx+[None])][1:]
打印:
[['LAST_NAME_IN_CAPS_1, First_1 Cell Phone: 999-999-999', 'Address', 'Needham MA 02135', 'Status: Attender Marital: Married Adult: Y M/F: M Env.No.:', '', 'Salutation 1:', ''],
['LAST_NAME_IN_CAPS_2, First_2 Cell Phone: 999-999-999', 'Address', 'Needham MA 02135', 'Status: Attender Marital: Married Adult: Y M/F: M Env.No.:', '', 'Salutation 2:', '', ''],
['LAST_NAME_IN_CAPS_3, First_3 Cell Phone: 999-999-999', 'Address ', 'Needham MA 02135', 'E-mail : email@gmail.com', 'Status: Attender Marital: Married Adult: Y M/F: F Env.No.:', '', 'Salutation 3:', '.', '.']]
非正则表达式版本的优点是它更容易支持国际字符:
# -*- coding: utf-8 -*-
txt='''\
ABÇ,
ABC,
abc,
ĖFG, '''
print [i for i, line in enumerate(txt.splitlines())
if line.partition(',')[0].isupper()]
# [0, 1, 3]