需要帮助解析XML文件中的标记并用新值替换它们

时间:2013-08-20 14:37:08

标签: python python-2.7 xml-parsing

我正试图在几周前修改那个电话号码脚本以帮助朋友。这是我用作起点的脚本。

# import regular expressions 
import re
# import argv 
from sys import argv

#arguments to provide at command line 
script, filename = argv

#load the file
data = open(filename)
#read the file
read_file = data.read()

# create a regular expression to filter out phone numbers 
phone_finder = re.compile(r"\(\d{3}\)\s*\d{3}-\d{4}")

# r to tell its a raw string
# \( to match "("
# \d{3} to match 3 digits
# \) to match ")"
# \s* account for no spaces
# \d{3} to match 3 digits
# - to match an "-"
# \d{4} to match 4 digits

# print the results
print phone_finder.findall(read_file)

他想要一种搜索​​XML文件并查找“<excerpt:encoded><![CDATA[]]></excerpt:encoded>"的方法  或

<excerpt:encoded><![CDATA[We love having a frother to make a latte or cappuccino, and think you'll enjoy some hot milk on these cold winter nights to put you to sleep as well.]]></excerpt:encoded>

并用

替换所有实例
<excerpt:encoded><![CDATA[]]></excerpt:encoded>

但我不确定这是怎么回事,因为在第二个例子中,文本中的每个实例的文本都不同。

我是Python的新手,所以任何帮助都会受到赞赏。 感谢您的时间。

1 个答案:

答案 0 :(得分:0)

要从<excerpt:encoded>元素中删除所有内容:

import xml.etree.cElementTree as etree

etree.register_namespace('excerpt', 'your namespace') # to preserve prefix

# read xml
doc = etree.parse(filename)

# clear elements
for element in doc.iter(tag='{your namespace}encoded'): 
    element.clear()

# write xml
doc.write(filename + '.cleared')

您应该将'your namespace'替换为实际名称空间excerpt前缀引用。