基于匹配的XML删除元素

时间:2018-04-28 15:38:53

标签: python xml parsing

我喜欢根据子元素匹配删除元素 file.xml的示例:

 <entry>
  <title>TEST1</title>
  <profile>
    <title>Default</title>
    <pid>
      <pidNumber>1880</pidNumber>
      <ContentType>PMT</ContentType>
      <isScrambled>0</isScrambled>
    </pid>
    <pid>
      <pidNumber>201</pidNumber>
      <ContentType>Video</ContentType>
      <isScrambled>0</isScrambled>
    </pid>
    <pid>
      <pidNumber>301</pidNumber>
      <ContentType>Audio</ContentType>
      <isScrambled>0</isScrambled>
    </pid>
    <pid>
      <pidNumber>302</pidNumber>
      <ContentType>Audio</ContentType>
      <isScrambled>0</isScrambled>
    </pid>
    <pid>
      <pidNumber>310</pidNumber>
      <ContentType>Audio</ContentType>
      <isScrambled>0</isScrambled>
    </pid>
  </profile>
</entry>

正如你可以看到很多PIDS值(201,301,302-310)我想删除所有匹配302-310的pid。这是我的代码,但是我收到了错误。

# -*- coding: utf-8 -*-
import re
from xml.etree import ElementTree as ET

root = ET.parse("file.xml").getroot()
regex = r"[3][0-1][02-9]"
getpid = root.iter("pid")

for item in getpid:
    pidnum = item.find('.//pidNumber')
    pidnum = pidnum.text
    match = re.findall(regex, pidnum)
    match = ''.join(match)
    if pidnum == match:
        ET.dump(item)
        item.remove(getpid)

tree = ET(root)
tree.write("out.xml")

我得到错误:

  

self._children.remove(元件)
  ValueError:list.remove(x):x不在列表中

如何解决?我想我已经接近了 感谢您查看和帮助。

2 个答案:

答案 0 :(得分:1)

  

我想删除所有与302-310匹配的pid。

我认为你的正则表达式逻辑是有缺陷的。如果您的pidNumber为319(或312313等),那么这些pid元素也会被移除。

此外,您的代码不会完全删除pid,而是删除其子项,留下空pid元素。 (也许这是他们想要的,但它听起来并不像是基于&#34; 我喜欢基于子元素匹配删除元素。&#34;。 )

请尝试使用getroot()获取find()元素,而不是使用profile。这是pid的父级,我们需要删除pid本身。

而不是使用正则表达式匹配pidNumber,只需进行基本比较。

示例...

file.xml (添加了额外的pid元素进行测试)

<entry>
    <title>TEST1</title>
    <profile>
        <title>Default</title>
        <pid>
            <pidNumber>1880</pidNumber>
            <ContentType>PMT</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>201</pidNumber>
            <ContentType>Video</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>301</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>302</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>303</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>309</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>310</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>319</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
    </profile>
</entry>

<强>的Python

from xml.etree import ElementTree as ET

tree = ET.parse("file.xml")
profile = tree.find("profile")

for pid in profile.findall(".//pid"):
    nbr = int(pid.find("pidNumber").text)
    if 302 <= nbr <= 310:
        profile.remove(pid)

tree.write('out.xml')

<强> out.xml

<entry>
    <title>TEST1</title>
    <profile>
        <title>Default</title>
        <pid>
            <pidNumber>1880</pidNumber>
            <ContentType>PMT</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>201</pidNumber>
            <ContentType>Video</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>301</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
        <pid>
            <pidNumber>319</pidNumber>
            <ContentType>Audio</ContentType>
            <isScrambled>0</isScrambled>
        </pid>
    </profile>
</entry>

另一种选择是使用lxml而不是ElementTree。这将为您提供完整的xpath支持,因此您可以在谓词中进行比较。

使用上面的file.xml输入,以下python产生与上面相同的out.xml输出。

from lxml import etree

tree = etree.parse("file.xml")
for pid in tree.xpath(".//pid[pidNumber[. >= 302][310 >= .]]"):
    pid.getparent().remove(pid)

tree.write("out.xml")

第三种选择是使用XSLT(感谢@Parfait的建议)......

<强>的Python

from lxml import etree

tree = etree.parse("file.xml")
xslt = etree.parse("test.xsl")
new_tree = tree.xslt(xslt)
new_tree.write_output("out_xslt.xml")

XSLT 1.0 (test.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="pid[pidNumber[. >= 302][310 >= .]]"/>

</xsl:stylesheet>

同样,这会产生与使用相同输入的其他选项相同的结果。

答案 1 :(得分:0)

以下是工作代码:

enter code hereimport re
from xml.etree import ElementTree as ET

tree = ET.parse("file.xml")
root = tree.getroot()
regex = r"[3][0-1][02-9]"
getpid = root.getiterator("pid")

for item in getpid:
    pidnum = item.find('.//pidNumber')
    pidnum = pidnum.text
    match = re.findall(regex, pidnum)
    match = ''.join(match)
    if pidnum == match:
        item.clear()
# create a new XML file with the results
tree.write('out.xml')

谢谢大家。