如何在python中使用正则表达式删除XML标签?

时间:2018-04-04 09:08:13

标签: python regex xml

python中的字符串可以包含某些纯文本以及包含某些信息的少量XML标记。例如:

The student XYZ abc has been terminated from the institute. 
you can find the details of student below:
<info StatusCode="End">
    <user_detail>
        <name>
            <first_name>ABC</first_name>
            <last_name>XYZ</last_name>
        </name>
        <contact_details>
            <contact_number>
                <number_type>landline</number_type>
                <number>1234567</number>
            </contact_number>
            <address>
                <address_field1> lorem ipsum, qwerty </address_field1>
                <address_field2> lorem ipsum2, qwerty2 </address_field2>
                <city> asdfgh </city>
                <state> zxcvbn </state>
                <country> India </country>
            </address>
        </contact_details>
    </user_detail>
    <flight_detail>
        ...
    </flight_detail>
</info>
Lorem ipsum dolor sit amet, pro ea dicat velit regione, modo putant 
sensibus pri id, ut bonorum scripserit sit. Ex nec tation alienum, est ut 
nemore efficiendi interpretaris, vis te reque eleifend. 
<xml_tag>
...
</xml_tag>
Laudem delectus
reprehendunt ei mei, has nisl dolorem mnesarchum no, ad eos modo singulis
euripidis. Quo no consul offendit. Eu alia utroque argumentum vix, no 
case primis eum.
<xml_tag>
....
</xml_tag>

XML的开头标记为<info>并不固定,它可以是<session StatusCode="End">,在这种情况下,结尾标记为</session>。 目前,我正在使用

删除此xml标记
data = re.sub(r'<[^<]+>', "", data)

但是,现在我想从这个文本中删除所有XML内容。我现在想要的最终输出是:

The student XYZ abc has been terminated from the institute. 
you can find the details of student below:
Lorem ipsum dolor sit amet, pro ea dicat velit regione, modo putant 
sensibus pri id, ut bonorum scripserit sit. Ex nec tation alienum, est ut 
nemore efficiendi interpretaris, vis te reque eleifend. 
Laudem delectus
reprehendunt ei mei, has nisl dolorem mnesarchum no, ad eos modo singulis
euripidis. Quo no consul offendit. Eu alia utroque argumentum vix, no 
case primis eum. 

我尝试使用</\S+>进行匹配,但它会在第一次关闭XML标记之前删除。如何从包含简单文本的纯文本字符串中删除所有XML内容。

1 个答案:

答案 0 :(得分:1)

带有单行选项的

<(.*?>)(.*)</\1匹配您要删除的XML。 innerxml位于第二组

请参阅https://regex101.com/r/HwiA2t/1