Python阶段XML通过某些属性删除元素并替换属性中的文本

时间:2018-04-22 16:00:33

标签: python xml xml-parsing elementtree

我有以下XML文件:

<tv>
    <programme channel="BBC Red Button 1" start="20180422123000 +0000" stop="20180422125500 +0000">
        <title lang="en">Live Snooker: The World Championship: Day Two - 2018</title>
        <desc lang="en">Coverage of day two at the Crucible Theatre in Sheffield</desc>
        <category lang="en">Sport</category>
        <icon src="http://images.radiotimes.com/remote/static.radiotimes.com.edgesuite.net/pa/70/26/webANXsnookerlivebbc.jpg?quality=60&amp;mode=crop&amp;width=130&amp;height=100&amp;404=tv" />
    </programme>
    <programme channel="BBC Red Button 1" start="20180422125500 +0000" stop="20180422150000 +0000">
        <title lang="en">Live UEFA Women's Champions League</title>
        <desc lang="en">Manchester City v Lyon (Kick-off 1.00pm)</desc>
        <category lang="en">Sport</category>
        <icon src="http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&amp;mode=crop&amp;width=130&amp;height=100&amp;404=tv" />
     </programme>
</tv>

首先我要删除src等于

的元素图标
<icon src="http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&amp;mode=crop&amp;width=130&amp;height=100&amp;404=tv" />

然后对于剩余的图标,我正在尝试将quality=60&amp;mode=crop&amp;width=130&amp;height=100替换为quality=100&amp;mode=crop&amp;width=1200&amp;height=723

因此,一旦XML文件被分阶段,它将如下所示:

<tv>
    <programme channel="BBC Red Button 1" start="20180422123000 +0000" stop="20180422125500 +0000">
        <title lang="en">Live Snooker: The World Championship: Day Two - 2018</title>
        <desc lang="en">Coverage of day two at the Crucible Theatre in Sheffield</desc>
        <category lang="en">Sport</category>
        <icon src="http://images.radiotimes.com/remote/static.radiotimes.com.edgesuite.net/pa/70/26/webANXsnookerlivebbc.jpg?quality=100&amp;mode=crop&amp;width=1200&amp;height=723&amp;404=tv" />
    </programme>
    <programme channel="BBC Red Button 1" start="20180422125500 +0000" stop="20180422150000 +0000">
        <title lang="en">Live UEFA Women's Champions League</title>
        <desc lang="en">Manchester City v Lyon (Kick-off 1.00pm)</desc>
        <category lang="en">Sport</category>
     </programme>
</tv>

我首先需要在替换其他值之前删除我不想要的XML文件中的图标,所以我最终不会更改我要删除的图标的值,到目前为止我已经尝试过了以下删除图标,但我没有成功:

#!/bin/sh

from xml.etree.ElementTree import ElementTree

t = ElementTree()
t.parse('/volume1/TVMosaic/Freeview-WG++/guide.xml')
programmeList = t.findall('tv/programme/icon')
for programmeEl in programmeList:
    if programmeEl.attrib['src'] in ('http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&amp;mode=crop&amp;width=130&amp;height=100&amp;404=tv') and \
            programmeEl.attrib['src'] == programmeEl.text:
        del programmeEl.attrib['src']
t.write('/volume1/TVMosaic/Freeview-WG++/PhasedGuide.xml')

有人能帮我删除那些我提到的那个src的图标,然后用我之前提到的值替换其余图标中的值。

谢谢。

1 个答案:

答案 0 :(得分:0)

问题在于,您正在寻找的字符串是 XML 转义 (请注意&#34; &amp; amp ; &#34; s),在解析文件时,字符串未转义(&amp; amp; 转换为 &amp; - 还有其他一些)。有关详细信息,请查看[Python.Wiki]: Escaping XML

code.py

#!/usr/bin/env python3

import sys
from xml.etree import ElementTree as ET
from xml.sax.saxutils import escape, unescape


INPUT_FILE_NAME = "guide.xml"
OUTPUT_FILE_NAME = "PhasedGuide.xml"
SRC_ATTR_TEXT = "http://images.radiotimes.com/assets/images/holding/tv.png?quality=60&amp;mode=crop&amp;width=130&amp;height=100&amp;404=tv"
SRC_ATTR_REPLACE_TEXT = "quality=60&amp;mode=crop&amp;width=130&amp;height=100"
SRC_ATTR_REPLACE_WITH_TEXT = "quality=100&amp;mode=crop&amp;width=1200&amp;height=723"


def main():
    tree = ET.parse(INPUT_FILE_NAME)
    tv_node = tree.getroot()
    for programme_node in tv_node.findall("programme"):
        icon_node = programme_node.find("icon")
        if icon_node is not None:
            print(icon_node.get("src", ""))
            src_attr = escape(icon_node.get("src", ""))
            if src_attr == SRC_ATTR_TEXT:
                programme_node.remove(icon_node)
            elif src_attr:
                icon_node.set("src", unescape(src_attr.replace(SRC_ATTR_REPLACE_TEXT, SRC_ATTR_REPLACE_WITH_TEXT)))
    tree.write(OUTPUT_FILE_NAME)


if __name__ == "__main__":
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    main()

备注

  • 算法加载并解析文件,并获取根节点( tv
  • 遍历所有程序子项
  • 对于每个人,尝试查找图标子项,如果找到它的 src 属性(值转义
  • 然后,根据属性(转义)值,它执行所需的操作

<强>输出

(py35x64_test) e:\Work\Dev\StackOverflow\q049967927>"e:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug  8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32


(py35x64_test) e:\Work\Dev\StackOverflow\q049967927>type PhasedGuide.xml
<tv>
    <programme channel="BBC Red Button 1" start="20180422123000 +0000" stop="20180422125500 +0000">
        <title lang="en">Live Snooker: The World Championship: Day Two - 2018</title>
        <desc lang="en">Coverage of day two at the Crucible Theatre in Sheffield</desc>
        <category lang="en">Sport</category>
        <icon src="http://images.radiotimes.com/remote/static.radiotimes.com.edgesuite.net/pa/70/26/webANXsnookerlivebbc.jpg?quality=100&amp;mode=crop&amp;width=1200&amp;height=723&amp;404=tv" />
    </programme>
    <programme channel="BBC Red Button 1" start="20180422125500 +0000" stop="20180422150000 +0000">
        <title lang="en">Live UEFA Women's Champions League</title>
        <desc lang="en">Manchester City v Lyon (Kick-off 1.00pm)</desc>
        <category lang="en">Sport</category>
        </programme>
</tv>