将捕获的组结果拆分为re.sub()

时间:2018-04-26 12:52:25

标签: python regex

for(var i=0;i<parseInt(j);i++){...}

期望输出:

InputString = r'On <ENAMEX TYPE="DATE">August 17</ENAMEX> , <ENAMEX TYPE="GPE">Tai wan</ENAMEX> is investigation department.'
p1 = r'<ENAMEX TYPE="(\S+)">(.+?)</ENAMEX>'
p2 = '_'.join(r'\2'.split(' '))
plain_text = re.sub(p1,p2,InputString)

不幸的是,我得到了结果:

On August_17 , Tai_wan is investigation department.

如何拆分捕获的群组On August 17 , Tai wan is investigation department.

1 个答案:

答案 0 :(得分:0)

您似乎只想将匹配替换为第二组(ENAMEX标记之间的文字),并将所有空格替换为_

您可以使用

import re
InputString = r'On <ENAMEX TYPE="DATE">August 17</ENAMEX> , <ENAMEX TYPE="GPE">Tai wan</ENAMEX> is investigation department.'
p1 = r'<ENAMEX TYPE="[^"]+">(.*?)</ENAMEX>'
plain_text = re.sub(p1,lambda p2: p2.group(1).replace(' ', '_'),InputString)
print(plain_text)
# => On August_17 , Tai_wan is investigation department.

请参阅Python demo

此处,<ENAMEX TYPE="[^"]+">(.*?)</ENAMEX>匹配<ENAMEX TYPE=",除"以外的任何1个字符,包括",然后匹配>然后捕获除了换行符之外的任何0+字符到组1.然后,</ENAMEX>子串匹配。 lambda表达式仅粘贴组1的内容,文字空格替换为下划线。请注意,如果要使用下划线替换任何空白字符,可以使用re.sub(r'\s', '_', p2.group(1))