我正在尝试使用lxml执行此操作,但最终它是关于正确的xpath的问题。
我想从<pgBreak>
元素中选择,直到其父元素结束,在这种情况下<p
&gt;
<root>
<pgBreak pgId="1"/>
<p>
some text to fill out a para
<pgBreak pgId="2"/>
some more text
<quote> A quoted block </quote>
remainder of para
</p>
</root>
<root>
<pgBreak pgId="1"/>
<p>
some text to fill out a para
</p>
<pgBreak pgId="2"/>
<p>
some more text
<quote> A quoted block </quote>
remainder of para
</p>
</root>
答案 0 :(得分:1)
您要做的事情并非无足轻重:您不仅要匹配'pgBreak'元素和所有后续兄弟,还要将它们移到父范围之外并将兄弟姐妹包装在'p'元素中。有趣的东西。
以下代码应该让您了解如何实现这一点(免责声明:仅示例,需要清理,边缘情况可能无法处理)。代码是故意取消注释的,所以你必须弄明白:)
我稍微修改了输入XML以更好地说明功能。
import lxml.etree
text = """
<root>
<pgBreak pgId="1"/>
<p>
some text to fill out a para
<pgBreak pgId="2"/>
some more text
<quote> A quoted block </quote>
remainder of para
<pgBreak pgId="3"/>
<p>
blurb
</p>
</p>
</root>
"""
root = lxml.etree.fromstring(text)
for pgbreak in root.xpath('//pgBreak'):
inner = pgbreak.getparent()
if inner == root:
continue
outer = inner.getparent()
pgbreak_index = inner.index(pgbreak)
inner_index = outer.index(inner) + 1
siblings = inner[pgbreak_index+1:]
inner.remove(pgbreak)
outer.insert(inner_index,pgbreak)
if siblings[0].tag != 'p':
p = lxml.etree.Element('p')
p.text = pgbreak.tail
pgbreak.tail = None
for node in siblings:
p.append(node)
outer.insert(inner_index+1,p)
else:
for node in siblings:
inner_index += 1
outer.insert(inner_index,node)
输出是:
<root>
<pgBreak pgId="1"/>
<p>
some text to fill out a para
</p>
<pgBreak pgId="2"/>
<p>
some more text
<quote> A quoted block </quote>
remainder of para
</p>
<pgBreak pgId="3"/>
<p>
blurb
</p>
</root>