xpath从子元素到父元素的结尾进行选择

时间:2013-03-20 22:19:47

标签: python xml xpath lxml

我正在尝试使用lxml执行此操作,但最终它是关于正确的xpath的问题。 我想从<pgBreak>元素中选择,直到其父元素结束,在这种情况下<p&gt;

XML IN:

  <root>
     <pgBreak pgId="1"/>
      <p>
         some text to fill out a para
           <pgBreak pgId="2"/>
            some more text 
            <quote> A quoted block </quote>
            remainder of para
      </p>
    </root>

XML OUT:

  <root>
     <pgBreak pgId="1"/>
      <p>
         some text to fill out a para
       </p>
          <pgBreak pgId="2"/>
       <p>
             some more text 
            <quote> A quoted block </quote>
            remainder of para
      </p>
    </root>

1 个答案:

答案 0 :(得分:1)

您要做的事情并非无足轻重:您不仅要匹配'pgBreak'元素和所有后续兄弟,还要将它们移到父范围之外并将兄弟姐妹包装在'p'元素中。有趣的东西。

以下代码应该让您了解如何实现这一点(免责声明:仅示例,需要清理,边缘情况可能无法处理)。代码是故意取消注释的,所以你必须弄明白:)

我稍微修改了输入XML以更好地说明功能。

import lxml.etree

text = """
<root>
  <pgBreak pgId="1"/>
  <p>
    some text to fill out a para
    <pgBreak pgId="2"/>
    some more text 
    <quote> A quoted block </quote>
    remainder of para
    <pgBreak pgId="3"/>
    <p>
       blurb
    </p>
  </p>
</root>
"""

root = lxml.etree.fromstring(text)
for pgbreak in root.xpath('//pgBreak'):
    inner = pgbreak.getparent()
    if inner == root:
        continue
    outer = inner.getparent()
    pgbreak_index = inner.index(pgbreak)
    inner_index = outer.index(inner) + 1
    siblings = inner[pgbreak_index+1:]
    inner.remove(pgbreak)
    outer.insert(inner_index,pgbreak)
    if siblings[0].tag != 'p':
        p = lxml.etree.Element('p')
        p.text = pgbreak.tail
        pgbreak.tail = None
        for node in siblings:
            p.append(node)
        outer.insert(inner_index+1,p)
    else:
        for node in siblings:
            inner_index += 1
            outer.insert(inner_index,node)

输出是:

<root>
  <pgBreak pgId="1"/>
  <p>
    some text to fill out a para
  </p>
  <pgBreak pgId="2"/>
  <p>
    some more text 
    <quote> A quoted block </quote>
    remainder of para
  </p>
  <pgBreak pgId="3"/>
  <p>
    blurb
  </p>
</root>