Question

这不是一个重复的问题（How to delete “\r\n”, but only in certain sections of my text?）我正在寻找更灵活的解决方案;

这不是XML文件的问题;只是为了这个样本它恰好是xml形式;这是关于如何实施这个想法的一般性问题;

这是一个非常大的文本样本：

<nx1>home</nx1>
<nx2>living</nx2>
<out>text one
 text continues
 and at last!</out>
<m2>dog</m2>

如何搜索<out>...</out>块，然后在该块中用空字符串替换所有\r\n个字符（删除换行符），然后将新创建的文本替换为旧{ {1}}阻止？

所以这个样本的结果是：

<out>...</out>

如何使用正则表达式在python中实现它？

Answer 1

如果更喜欢使用正则表达式，我会使用函数作为替代。

>>> import re
>>> def as_tag(match):
...     return match.group(1).replace('\r\n', '')
...
>>> text = '''
<nx1>home</nx1>
<nx2>living</nx2>
<out>text one 
 text continues
 and at last!</out>
<m2>dog</m2>
'''
>>> re.sub(r'(<out>[^<]*</out>)', as_tag, text)

输出

<nx1>home</nx1>
<nx2>living</nx2>
<out>text one text continues and at last!</out>
<m2>dog</m2>

Answer 2

由于此特定输入是XML，因此这是一个涉及lxml解析器和normalize-space()的解决方案：

>>> from lxml import etree
>>> text = """
... <root>
...     <nx1>home</nx1>
...     <nx2>living</nx2>
...     <out>text one
...      text continues
...      and at last!</out>
...     <m2>dog</m2>
... </root>
... """
>>> tree = etree.fromstring(text)
>>> out = tree.find('out')
>>> out.text = out.xpath('normalize-space(text())')
>>> print etree.tostring(tree)
<root>
    <nx1>home</nx1>
    <nx2>living</nx2>
    <out>text one text continues and at last!</out>
    <m2>dog</m2>
</root>

Answer 3

我建议您使用Python The ElementTree XML API。请查看以下部分：修改XML文件

在正则表达式匹配中搜索和替换

3 个答案: